1 / 52

Summit On Self Adapting Software and Algorithms August 7th and 8th, 2002

Summit On Self Adapting Software and Algorithms August 7th and 8th, 2002. University of Tennessee Claxton Complex, Room 233 Hosted by the Innovative Computing Laboratory. Agenda. Presentations from various groups

zev
Download Presentation

Summit On Self Adapting Software and Algorithms August 7th and 8th, 2002

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Summit On Self Adapting Software and AlgorithmsAugust 7th and 8th, 2002 University of TennesseeClaxton Complex, Room 233 Hosted by the Innovative Computing Laboratory

  2. Agenda • Presentations from various groups • These presentations should give all participants the opportunity to get information about the current activities of all groups and about their future plans and visions. • Discussion for a consensus on common goals. • Formulate common goals and try to agree upon ways how to reach them. • Discussion of possible funding opportunities

  3. Self Adapting Numerical Software andUpdate on NetSolve University of Tennessee

  4. Outline • Overview of the SANS-Effort • Review of ATLAS • Sparse Optimization • LAPACK For Clusters (LFC) • Review of NetSolve • Grid enabled numerical libraries

  5. Innovative Computing Laboratory • Numerical Linear Algebra • Heterogeneous Distributed Computing • Software Repositories • Performance Evaluation Software and ideas have found there way into many areas of Computational Science Around 40 people: At the moment... 16 Researchers: Research Assoc/Post-Doc/Research Prof 15 Students: Graduate and Undergraduate 8 Support staff: Secretary, Systems, Artist 1 Long term visitors (Japan) Responsible for about $3-4M/years in research funding from NSF, DOE, DOD, etc

  6. TUNING SYSTEM Different Algorithms, Segment Sizes Data Structure Best Algorithm, Segment Size Data Structure Self-Adapting Numerical Software (SANS) Effort • The complexities of modern processors or clusters makes it difficult to achieve high performance • Hardware, compilers, and software have a large design space w/many parameters • Kernels of Computation Routines • Focus on where the most time is spent • Need for quick/dynamic deployment of optimized routines. • Algorithm layout and implementation • Look at the different ways to express implementation

  7. Takes ~ 20 minutes to run, generates Level 1,2, & 3 BLAS “New” model of high performance programming where critical code is machine generated using parameter optimization. Designed for RISC arch Super Scalar Need reasonable C compiler Today ATLAS in used within various ASCI and SciDAC activities and by Matlab, Mathematica, Octave, Maple, Debian, Scyld Beowulf, SuSE,… Parameter study of the hw Generate multiple versions of code, w/difference values of key performance parameters Run and measure the performance for various versions Pick best and generate library Level 1 cache multiply optimizes for: TLB access L1 cache reuse FP unit usage Memory fetch Register reuse Loop overhead minimization Software Generation Strategy - ATLAS BLAS

  8. Show Matlab example

  9. ATLAS Matrix Multiply Intel Pentium 4 at 2.53GHz – using SSE2 P4 32-bit fl pt using SSE2 P4 64-bit fl pt using SSE2 ~$1000 for system => $0.25/Mflops !!

  10. Multi-Threaded DGEMMIntel PIII 550 MHz

  11. Experiments with C, Fortran, and Java for ATLAS (DGEMM kernel)

  12. 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Recursive Approach for Other Level 3 BLAS Recursive TRMM • Recur down to L1 cache block size • Need kernel at bottom of recursion • Use gemm-based kernel for portability

  13. Intel PIII 933 MHzMKL 5.0 vs ATLAS 3.2.0 using Windows 2000 • ATLAS is faster than all other portable BLAS implementations and it is comparable with machine-specific libraries provided by the vendor.

  14. Machine-Assisted Application Development and Adaptation • Communication libraries • Optimize for the specifics of one’s configuration. • Algorithm layout and implementation • Look at the different ways to express implementation

  15. Sequential Binary Binomial Ring Root Message Size Optimal algorithm Buffer Size (bytes) (bytes) 8 binomial 8 16 binomial 16 32 binary 32 64 binomial 64 128 binomial 128 256 binomial 256 512 binomial 512 1K sequential 1K 2K binary 2K 4K binary 2K 8K binary 2K 16K binary 4K 32K binary 4K 64K ring 4K 128K ring 4K 256K ring 4K 512K ring 4K 1M binary 4K Work in Progress:ATLAS-like Approach Applied to Broadcast (PII 8 Way Cluster with 100 Mb/s switched network)

  16. Reformulating/Rearranging/Reuse • Example is the reduction to narrow band from for the SVD • Fetch each entry of A once • Restructure and combined operations • Results in a speedup of > 30%

  17. CG Variants by Dynamic Selection at Run Time • Variants combine inner products to reduce communication bottleneck at the expense of more scalar ops. • Same number of iterations, no advantage on a sequential processor • With a large number of processor and a high-latency network may be advantages. • Improvements can range from 15% to 50% depending on size.

  18. CG Variants by Dynamic Selection at Run Time • Variants combine inner products to reduce communication bottleneck at the expense of more scalar ops. • Same number of iterations, no advantage on a sequential processor • With a large number of processor and a high-latency network may be advantages. • Improvements can range from 15% to 50% depending on size.

  19. Solving Large Sparse Non-Symmetric Systems of Linear Equations Using BiCG-Stab • Example of optimizations • Combining 2 vector ops into 1 loop • Simplifying indexing • Removal of “if test” within loop

  20. Optimization of BiCG-Stab 10% - 20% Improvement

  21. Split ADI Method • Originally from a radiation transport problem • Solution of medium-size dense blocks • Well-conditioned => iterative method Write A=D-E=D-1(I-N); M-1 =(I+N)D -1 Left-preconditioning: M-1 A = (I-N2) D-1 is cheap, question is how efficient is y = N2 x; Can we beat 2 matrix-vector operations?

  22. Performance Of L1Cache A2x Kernel • Avoiding 2 calls to MV routine • Do blocking to capture peak • 10% improvement by simple optimization

  23. ScaLAPACK • ScaLAPACK is a portable distributed memory numerical library • Complete numerical library for dense matrix computations • Designed for distributed parallel computing (MPP & Clusters) using MPI • One of the first math software packages to do this • Numerical software that will work on a heterogeneous platform • Funding from DOE, NSF, and DARPA • In use today by IBM, HP-Convex, Fujitsu, NEC, Sun, SGI, Cray, NAG, IMSL, … • Tailor performance & provide support

  24. To Use ScaLAPACK a User Must: • Download the package and auxiliary packages (like PBLAS, BLAS, BLACS, & MPI) to the machines. • Write a SPMD program which • Sets up the logical 2-D process grid • Places the data on the logical process grid • Calls the numerical library routine in a SPMD fashion • Collects the solution after the library routine finishes • The user must allocate the processors and decide the number of processes the application will run on • The user must start the application • “mpirun –np N user_app” • Note: the number of processors is fixed by the user before the run, if problem size changes dynamically … • Upon completion, return the processors to the pool of resources

  25. ScaLAPACK Cluster Enabled • Implement a version of a ScaLAPACK library routine that runs on clusters. • Make use of resources at the user’s disposal • Provide the best time to solution • Proceed without the user’s involvement • Make as few changes as possible to the numerical software.

  26. ScaLAPACK Performance Model • Total number of floating-point operations per processor • Total number of data items communicated per processor • Total number of messages • Time per floating point operation • Time per data item communicated • Time per message

  27. LAPACK For Clusters • Developing middleware which couples cluster system information with the specifics of a user problem to launch cluster based applications on the “best” set of resource available. • Using ScaLAPACK as the prototype software 1

  28. Big Picture… User has problem to solve (e.g. Ax = b) Natural Data (A,b) Natural Answer (x) Middleware Structured Data (A’,b’) Structured Answer (x’) Application Library (e.g. LAPACK, ScaLAPACK, PETSc,…)

  29. Numerical Libraries for Clusters User Stage data to disk A b

  30. Numerical Libraries for Clusters User A b Library Middle-ware

  31. Numerical Libraries for Clusters User A b Library Middle-ware NWS Autopilot MDS Time function minimization Resource Selection

  32. Numerical Libraries for Clusters 0,0 0,1 … User A b Library Middle-ware NWS Autopilot MDS Time function minimization Resource Selection Can use Grid infrastructure, i.e.Globus/NWS, but doesn’t have to.

  33. Resource Selector • Use information on Bandwidth/Latency/Load/Memory/CPU performance • 2 matrices (bw,lat) 3 arrays (load, cpu, memory available) • Generated dynamically by library routine CPU Performance Memory Bandwidth Load Latency

  34. LFC will automate much of the decisions in the Cluster environment to provide best time to solution. Adaptivity to the dynamic environment. As the complexities of the Clusters and Grid increase need to develop strategies for self adaptability. Handcrafted developed leading to an automated design. Developing a basic infrastructure for computational science applications and software in the Cluster and Grid environment. Lack of tools is hampering development today. Plan to do LU, Cholesky, QR, Symmetric eigenvalue, and Nonsymmetric eigenvalue LAPACK For Clusters (LFC)

  35. Future SANS Effort • Intelligent Component (IC) • Automates method selection based on data, algorithm, and system attributes • System component • Provides intelligent management of and access to the computational grid • History database • Records relevant info generated by the IC and maintains past performance data

  36. NetSolve to Determine the Best Algorithm/Software (A bit in the future) Matrix type • x = netsolve(‘linearsolve’,A,b) • NetSolve examines the user’s data together with the resources available (in the grid sense) and makes decisions on best time to solution dynamically. • Decision based on size of problem, hardware/software, network connection etc. • Method and Placement. Sparse Dense Direct Iterative Symmetric General Rectangular General Symmetric Symmetric General

  37. Decisions • What is the matrix shape? • Rectangular or square • How dense is the matrix? • Checks the percent of non-zeros • Is the matrix banded? • What is the diagonal structure? • Zeros on diagonal • Alternating signs • Are there dense blocks? • Special structure • Zero 2x2 blocks

  38. Sparse Solvers • Direct or iterative? • Size of the problem • Fill-in • Memory available? • Numerical properties • Start iterative method and monitor progress • Estimate number of iterations

  39. Heuristics • If sparse and symmetric • Run CG and see if breakdown occurs • Try on A+I, where  is some fraction of what’s needed to make the matrix diagonally dominant (QMR or GMRES might work) • For nonsymmetric case, adaptively assess spectrum, monitor residual and Ritz values

  40. Structure • Domain-partitioning methods • Additive or Multiplicative Schwarz, or Schur-complement Domain Decomposition. • Algebraic MultiGrid methods can also be used if the matrix has a discernible structure. • Close enough to elliptic • For more general matrices, we can use an ILU preconditioner. • One variant that is reasonably robust is the Manteuffel ILU preconditioner

  41. Decision Tree

  42. NetSolve - Grid Enabled Server • NetSolve is an example of a Grid based hardware/software server. • Based on a Remote Procedure Call model but with … • resource discovery, dynamic problem solving capabilities, load balancing, fault tolerance asynchronicity, security, … • Easy-of-use paramount • Other examples are NEOS from Argonne and NINF Japan.

  43. NetSolve: The Big Picture Client Schedule Database AGENT(s) Matlab, Octave, Scilab Mathematica C, Fortran IBP Depot S3 S4 S1 S2 No knowledge of the grid required, RPC like.

  44. NetSolve: The Big Picture Client Schedule Database AGENT(s) A, B Matlab, Octave, Scilab Mathematica C, Fortran IBP Depot S3 S4 S1 S2 No knowledge of the grid required, RPC like.

  45. NetSolve: The Big Picture Client Schedule Database AGENT(s) Handle back Matlab, Octave, Scilab Mathematica C, Fortran IBP Depot S3 S4 S1 S2 No knowledge of the grid required, RPC like.

  46. Request S2 ! A, B OP, handle Answer (C) NetSolve: The Big Picture Client Schedule Database AGENT(s) Matlab, Octave, Scilab Mathematica C, Fortran IBP Depot S3 S4 Op(C, A, B) S1 S2 No knowledge of the grid required, RPC like.

  47. Hiding the Parallel Processing • User maybe unaware of parallel processing • NetSolve takes care of the starting the message passing system, data distribution, and returning the results.

More Related