1 / 43

Introduction to Scientific Computing

Introduction to Scientific Computing. Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization. Outline. Introduction Software Parallelization Hardware. Introduction. What is Scientific Computing? Need for speed Need for memory

chaviva
Download Presentation

Introduction to Scientific Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Scientific Computing Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

  2. Outline • Introduction • Software • Parallelization • Hardware

  3. Introduction • What is Scientific Computing? • Need for speed • Need for memory • Simulations tend to grow until they overwhelm available resources • If I can simulate 1000 neurons, wouldn’t it be cool if I could do 2000? 10000? 1087? • Example – flow over an airplane • It has been estimated that if a teraflop machine were available, would take about 200,000 years to solve (resolving all scales). • If Homo Erectus had a teraflop machine, we could be getting the result right about now.

  4. Introduction (cont’d) • Optimization • Profile serial (1-processor) code • Tells where most time is consumed • Is there any “low fruit”? • Faster algorithm • Optimized library • Wasted operations • Parallelization • Break problem up into chunks • Solve chunks simultaneously on different processors

  5. Software

  6. Compiler • The compiler is your friend (usually) • Optimizers are quite refined • Always try highest level • Usually –O3 • Sometimes –fast, -O5, … • Loads of flags, many for optimization • Good news – many compilers will automatically parallelize for shared-memory systems • Bad news – this usually doesn’t work well

  7. Software • Libraries • Solver is often a major consumer of CPU time • Numerical Recipes is a good book, but many algorithms are not optimal • Lapack is a good resource • Libraries are often available that have been optimized for the local architecture • Disadvantage – not portable

  8. Parallelization

  9. Parallelization • Divide and conquer! • divide operations among many processors • perform operations simultaneously • if serial run takes 10 hours and we hit the problem with 5000 processors, it should take about 7 seconds to complete, right? • not so easy, of course

  10. Parallelization (cont’d) • problem – some calculations depend upon previous calculations • can’t be performed simultaneously • sometimes tied to the physics of the problem, e.g., time evolution of a system • want to maximize amount of parallel code • occasionally easy • usually requires some work

  11. Parallelization (3) • method used for parallelization may depend on hardware • distributed memory • each processor has own address space • if one processor needs data from another processor, must be explicitly passed • shared memory • common address space • no message passing required

  12. Parallelization (4) proc 0 proc 0 proc 1 proc 1 proc 2 proc 2 proc 3 proc 3 proc 0 proc 1 proc 2 proc 3 mem 0 mem 0 mem 1 mem 1 mem 2 mem 3 mem shared memory distributed memory mixed memory

  13. Parallelization (5) • MPI • for both distributed and shared memory • portable • freely downloadable • OpenMP • shared memory only • must be supported by compiler (most do) • usually easier than MPI • can be implemented incrementally

  14. MPI • Computational domain is typically decomposed into regions • One region assigned to each processor • Separate copy of program runs on each processor

  15. MPI • Discretized domain to solve flow over airfoil • System of coupled PDE’s solved at each point

  16. MPI • Decomposed domain for 4 processors

  17. MPI • Since points depend on adjacent points, must transfer information after each iteration • This is done with explicit calls in the source code

  18. MPI • Diminishing returns • Sending messages can get expensive • Want to maximize ratio of computation to communication • Parallel speedup, parallel efficiency T = time n = number of processors

  19. Speedup

  20. Parallel Efficiency

  21. OpenMP for(i=0; i<N; i++){ do lots of stuff } • Usually loop-level parallelization • An OpenMP directive is placed in the source code before the loop • Assigns subset of loop indices to each processor • No message passing since each processor can “see” the whole domain

  22. OpenMP for(i = 0; i < 7; i++) a[i] = 1; for(i = 1; i < 7; i++) a[i] = 2*a[i-1]; Example of how to do it wrong! • Can’t guarantee order of operations Parallelize this loop on 2 processors Proc. 0 Proc. 1

  23. Hardware

  24. Hardware • A faster processor is obviously good, but: • Memory access speed is often a big driver • Cache – a critical element of memory system • Processors have internal parallelism such as pipelines and multiply-add instructions

  25. Cache • Cache is a small chunk of fast memory between the main memory and the registers registers primary cache secondary cache main memory

  26. Cache (cont’d) • Variables are moved from main memory to cache in lines • L1 cache line sizes on our machines • Opteron (blade cluster) 64 bytes • Power4 (p-series) 128 bytes • PPC440 (Blue Gene) 32 bytes • Pentium III (linux cluster) 32 bytes • If variables are used repeatedly, code will run faster since cache memory is much faster than main memory

  27. Cache (cont’d) • Why not just make the main memory out of the same stuff as cache? • Expensive • Runs hot • This was actually done in Cray computers • Liquid cooling system

  28. Cache (cont’d) • Cache hit • Required variable is in cache • Cache miss • Required variable not in cache • If cache is full, something else must be thrown out (sent back to main memory) to make room • Want to minimize number of cache misses

  29. Cache example “mini” cache holds 2 lines, 4 words each for(i==0; i<10; i++) x[i] = i x[8] x[0] x[1] x[9] x[2] a b x[3] Main memory x[4] … … x[5] x[6] x[7]

  30. Cache example (cont’d) • We will ignore i for simplicity • need x[0], not in cache cache miss • load line from memory into cache • next 3 loop indices result in cache hits x[0] x[1] x[2] x[3] for(i==0; i<10; i++) x[i] = i x[8] x[0] x[1] x[9] x[2] a b x[3] x[4] … … x[5] x[6] x[7]

  31. Cache example (cont’d) • need x[4], not in cache cache miss • load line from memory into cache • next 3 loop indices result in cache hits x[0] x[4] x[5] x[1] x[6] x[2] x[3] x[7] for(i==0; i<10; i++) x[i] = i x[8] x[0] x[1] x[9] x[2] a b x[3] x[4] … … x[5] x[6] x[7]

  32. Cache example (cont’d) • need x[8], not in cache cache miss • load line from memory into cache • no room in cache! • replace old line x[8] x[4] x[5] x[9] x[6] a b x[7] for(i==0; i<10; i++) x[i] = i x[8] x[0] x[1] x[9] x[2] a b x[3] x[4] … … x[5] x[6] x[7]

  33. Cache (cont’d) • Contiguous access is important • In C, multidimensional array is stored in memory as a[0][0] a[0][1] a[0][2] …

  34. Cache (cont’d) • In Fortran and Matlab, multidimensional array is stored the opposite way: a(1,1) a(2,1) a(3,1) …

  35. Cache (cont’d) • Rule: Always order your loops appropriately for(i=0; i<N; i++){ for(j=0; j<N; j++){ a[i][j] = 1.0; } } do j = 1, n do i = 1, n a(i,j) = 1.0 enddo enddo C Fortran

  36. SCF Machines

  37. p-series • Shared memory • IBM Power4 processors • 32 KB L1 cache per processor • 1.41 MB L2 cache per pair of processors • 128 MB L3 cache per 8 processors

  38. p-series

  39. Blue Gene • Distributed memory • 2048 processors • 1024 2-processor nodes • IBM PowerPC 440 processors • 700 MHz • 512 MB memory per node (per 2 processors) • 32 KB L1 cache per node • 2 MB L2 cache per node • 4 MB L3 cache per node

  40. BladeCenter • Hybrid memory • 56 processors • 14 4-processor nodes • AMD Opteron processors • 2.6 GHz • 8 GB memory per node (per 4 processors) • Each node has shared memory • 64 KB L1 cache per 2 processors • 1 MB L2 cache per 2 processors

  41. Linux Cluster • Hybrid memory • 104 processors • 52 2-processor nodes • Intel Pentium III processors • 1.3 GHz • 1 GB memory per node (per 2 processors) • Each node has shared memory • 16 KB L1 cache per 2 processors • 512 KB L2 cache per 2 processors

  42. For More Information • SCV web site http://scv.bu.edu/ • Today’s presentations are available at http://scv.bu.edu/documentation/presentations/ under the title “Introduction to Scientific Computing and Visualization”

  43. Next Time • G & T code • Time it • Look at effect of compiler flags • profile it • Where is time consumed? • Modify it to improve serial performance

More Related