1 / 15

Parallel Maximum Likelihood Fitting Using MPI

Parallel Maximum Likelihood Fitting Using MPI. Brian Meadows, U. Cincinnati and David Aston, SLAC. What is MPI ?. “Message Passing Interface” - a standard defined for passing messages between processors (CPU’s) Communications interface to Fortran, C or C++ (maybe others)

leala
Download Presentation

Parallel Maximum Likelihood Fitting Using MPI

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel Maximum Likelihood Fitting Using MPI Brian Meadows, U. Cincinnati and David Aston, SLAC

  2. What is MPI ? • “Message Passing Interface” - a standard defined for passing messages between processors (CPU’s) • Communications interface to Fortran, C or C++ (maybe others) • Definitions apply across different platforms (can mix Unix, Mac, etc.) • Parallelization of code is explicit - recognized and defined by users • Memory can be • Shared between CPU’s • Distributed among CPU’s OR • A hybrid of these • Number of CPU’s allowed is not pre-defined, but is fixed in any one application • The required number of CPU’s is defined by the user at job startup and does not undergo runtime optimization. B. Meadows, U. Cincinnati

  3. How Efficient is MPI ? • The best you can do is speed up a job by a factor equal to the number of physical CPU’s involved. • Factors limiting this • Poor synchronization between CPU’s due to unbalanced loads • Sections of code that cannot be vectorized • Signalling delays. • NOTE – it is possible to request more CPU’s than physically exist • This will produce some overhead in processing, though ! B. Meadows, U. Cincinnati

  4. Running MPI • Run the program with mpirun <job> -np N which submits N identical jobs to the system (You can also specify IP addresses for distributed CPU’s) • The OS in each machine allocates physical CPU’s dynamically as usual. • Each job • is given an ID (0 N-1) which it can access • needs to be in an identical environment to the others • Users can use this ID to label a main job (“JOB0” for example) and the remaining “satellite” jobs. B. Meadows, U. Cincinnati

  5. Fitting with MPI • For a fit, each job should be structured to be able to run the parts it is required to do: • Any set up (read in events, etc.) • The parts that are vectorized (e.g. its group of events or parameters). • One job needs to be identified as the main one “JOB0” and must do everything, farming out groups of events or parameters to the others. • Each satellite job must send results (“signals”) back to JOB0 when done with its group and await return “signal” from JOB0 when it must start again. B. Meadows, U. Cincinnati

  6. How MPI Runs “Scatter-Gather” running CPU 0 CPU 0 CPU 0 CPU 0 m p i r u n CPU 1 CPU 1 Wait Wait CPU 2 CPU 2 CPU… CPU… “Start” “Scatter” “Gather” B. Meadows, U. Cincinnati

  7. Ways to Implement MPI in Maximum Likelihood Fitting Two main alternatives: • Vectorize FCN - evaluates f(x) = -2S ln W • Vectorize MINUIT (which finds the best parameters) • Alternative A has been used in previous Babar analyses • E.g. Mixing analysis of D0 K+p- • Alternative B is reported here (done by DYAEB and tested by BTM) • An advantage of B over A is that the vectorization is implemented outside a user’s code. • Vectorizing FCN may not be efficient if an integral is computed on each call Unless the integral evaluation is also vectorized. B. Meadows, U. Cincinnati

  8. Vectorize FCN • Log-likelihood always includes a sum: where n = number of events or bins. • Vectorize computation of sum - 2 steps (“Scatter-Gather”): • Scatter: Divide up events (or bins) among the CPU’s. Each CPU computes • Gather: Re-combine the N CPU’s: B. Meadows, U. Cincinnati

  9. Vectorize FCN • Computation of the integral: also needs to be vectorized • This is usually a sum (over bins) so can be done in a similar way. • Main advantage of this method: • Assuming function evaluation dominates CPU cycles, your gain coefficient is close to 1.0 independent of number of CPU’s or pars. • Main dis-advantage: • It requires that the user code each application appropriately. B. Meadows, U. Cincinnati

  10. Vectorize MINUIT • Several algorithms in MINUIT: • MIGRAD (Variable metric algorithm) • Finds local minimum and error matrix at that point • SIMPLEX (Nelder-Mead method) • Linear programming method • SEEK (MC method) • Random search – virtually obsolete • Most often used is MIGRAD – so focus on that • Is easily vectorized, but results may not be at highest efficiency B. Meadows, U. Cincinnati

  11. One iteration in MIGRAD • Compute function and gradient at current position • Use current curvature metric to compute step: • Take (large) step: • Compute function and gradient there then (cubic) interpolate back to local minimum(may need to iterate) • If satisfactory, improve Curvature metric B. Meadows, U. Cincinnati

  12. One iteration in MIGRAD • Most of the time is spent in computing the gradient: • Numerical evaluation of gradient requires 2 FCN calls per parameter: • Vectorize this computation in two steps (“Scatter-Gather”): • Scatter: Divide up parameters (xi) among the CPU’s. Each CPU computes • Gather: Re-combine the N CPU’s. B. Meadows, U. Cincinnati

  13. Vectorize MIGRAD • This is less efficient the smaller the number of parameters • Works well if NPAR comparable to the number of CPU’s. Gain ~ NCPU*(NPAR + 2) / (NPAR + 2*NCPU) Max. Gain = NCPU For 105 parameters a factor 3.7 was gained with 4 CPU’s. B. Meadows, U. Cincinnati

  14. Initialization of MPI Program FIT_Kpipi C C- Maximum likelihood fit of D -> Kpipi Dalitz plot. C Implicit none Save external fcn include 'mpif.h' MPIerr= 0 MPIrank= 0 MPIprocs= 1 MPIflag= 1 call MPI_INIT(MPIerr) ! Initialize MPI call MPI_COMM_RANK(MPI_COMM_WORLD, MPIrank, MPIerr) ! Get number of CPU’s call MPI_COMM_SIZE(MPI_COMM_WORLD, MPIprocs, MPIerr) ! Which one am I ? … call MINUIT, etc. call MPI_FINALIZE(MPIerr) B. Meadows, U. Cincinnati

  15. Use of Scatter-Gather Mechanismin MNDERI (Fortran) C Distribute the parameters from proc 0 to everyone 33 call MPI_BCAST(X, NPAR+1, MPI_DOUBLE_PRECISION, 0, MPI_COMM_WORLD, MPIerr) … C Use scatter-gather mechanism to compute subset of derivatives in each process: nperproc= (NPAR-1)/MPIprocs + 1 iproc1= 1+nperproc*MPIrank iproc2= MIN(NPAR,iproc1+nperproc-1) call MPI_SCATTER(GRD, nperproc, MPI_DOUBLE_PRECISION, A GRD(iproc1), nperproc, MPI_DOUBLE_PRECISION, 0, MPI_COMM_WORLD, MPIerr) C C Loop over variable parameters DO 60 i=iproc1,iproc2 … compute G(I) End Do C C Wait until everyone is done: call MPI_GATHER(GRD(iproc1), nperproc, MPI_DOUBLE_PRECISION, A GRD, nperproc, MPI_DOUBLE_PRECISION, 0, MPI_COMM_WORLD, MPIerr) C everyone but proc 0 goes back to await the next set of parameters If ( MPIrank.ne.0) GO TO 33 C … Continue computation (CPU 0 only) B. Meadows, U. Cincinnati

More Related