1 / 54

Parallel coding

Parallel coding. Approaches in converting sequential code programs to run on parallel machines. Goals. Reduce wall-clock time Scalability – increase resolution expand space without loss of efficiency. It ’ s all about efficiency. Poor data communication Poor load balancing

vivien
Download Presentation

Parallel coding

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel coding Approaches in converting sequential code programs to run on parallel machines

  2. Goals • Reduce wall-clock time • Scalability – • increase resolution • expand space without loss of efficiency It’s all about efficiency Poor data communication Poor load balancing Inherently sequential algorithm nature

  3. Efficiency • Communication overhead – data transfer is at most 10-3 the processing speed • load balancing– uneven load which is statically balanced may cause idle processor time • Inherently sequential algorithm nature – if all tasks should be performed serially, no room for parallelization Lack of efficiency could cause a parallel code to perform worse than a similar sequential code

  4. Scalability Amdahl's Law states that potential program speedup is defined by the fraction of code (f) which can be parallelized

  5. Scalability 1 1 speedup = -------- = ----------- 1 - f P/N + S • speedup • ----------------------- • N P = .50 P = .90 P = .99 • ----- ------- ------- ------- • 1.82 5.26 9.17 • 1.98 9.17 50.25 • 1.99 9.91 90.99 • 10000 1.99 9.91 99.02

  6. Before we start - Framework • Code may be influenced/determined by machine architecture • The need to understand the architecture • Choose a programming paradigm • Choose the compiler • Determine communication • Choose the network topology • Add code to accomplish task control and communications • Make sure the code is sufficiently optimized (may involve the architecture) • Debug the code • Eliminate lines that impose unnecessary overhead

  7. Before we start - Program • If we are starting with an existing serial program, debug the serial code completely • Identify the parts of the program that can be executed concurrently: • Requires a thorough understanding of the algorithm • Exploit any inherent parallelism which may exist. • May require restructuring of the program and/or algorithm. May require an entirely new algorithm.

  8. Before we start - Framework • Architecture – Intel Xeon, 16Gb distributed memory, Rocks Cluster • Compiler – Intel FORTRAN/pgf • Network – star (mesh?) • Overhead – make sure the communication channels aren’t clogged (net admin) • Optimized Code – write c-code when necessary, use CPU pipelines, use debugged programs…

  9. Sequential Coding Practice

  10. Improvement methods Sequential coding practice

  11. The COMMON problem Problem: COMMON blocks are copied as one chunk of data each time a process forks The compiler doesn’t distinguish between active COMMON’s and redundant ones Sequential coding practice

  12. The COMMON problem On NUMA (Non Uniform Memory Access) MPP/SMP (massively parallel processing/Symmetric Multi Processor) Vector machines This is rarely an issue On a Distributed Computer (clusters) Crucial (network is congested by this)!!! Problem: COMMON blocks are copied as one chunk of data each time a process forks The compiler doesn’t distinguish between declared COMMON’s and redundant ones Sequential coding practice

  13. The COMMON problem • Resolution: • Pass only the required data for the task • Functional programming (pass arguments on the call) • On shared memory architectures use shmXXX commands • On distributed memory architectures use message passing Sequential coding practice

  14. Swapping to secondary storage Problem: swapping is transparent but uncontrolled – the kernel cannot predict which pages are needed next, only determine which are needed frequently Swap space is a way to emulate physical ram, right?No, generally swap space is a repository to hold things from memory when memory is low. Things in swap cannot be addressed directly and need to be paged into physical memory before use, so there's no way swap could be used to emulate memory. So no, 512M+512M swap is not the same as 1G memory and no swap. KernelTrap.org Sequential coding practice

  15. Swapping tosecondary storage - Example 381MB X 2 CPU– dual Intel Pentium3 Speed - 1000MHz RAM - 512 MB Compiler – IntelFortran Optimization – O2 (default) Sequential coding practice

  16. Swap space grows on demand RAM is fully consumed 135sec X 4000kB/sec = 520MB Each direction ! Garbage collecting takes time (memory is not freed) For processing 800Mb of data, 1Gb of data travels at harddisk rate throughout the run Swapping to secondary storage Sequential coding practice

  17. Swapping to secondary storage Problem: swapping is transparent but uncontrolled – the kernel cannot predict which pages are needed next, only determine which are needed frequently Swap space is a way to emulate physical ram, right?No, generally swap space is a repository to hold things from memory when memory is low. Things in swap cannot be addressed directly and need to be paged into physical memory before use, so there's no way swap could be used to emulate memory. So no, 512M+512M swap is not the same as 1G memory and no swap. KernelTrap.org Resolution: prevent swapping by adjusting the data amount into user process RAM size (read and write temporary files from/to disk). Sequential coding practice

  18. Swapping to secondary storage On every node Memory size = 2GB Predicted number of jobs pending= 3 Use MOSIX for load balancing Work with data segments no grater than 600Mb/process (open files + memory + output buffers) Sequential coding practice

  19. Paging, cache 16K Problem: like swapping, memory pages go in and out of CPU’s cache. Again, the compiler cannot predict the ordering of pages into the cache. Semi-controlled paging leads again to performance degradation Note: On-board memory is slower than cache memory (bus speed) but still faster than disk access Sequential coding practice

  20. Paging, cache 16K Cache size (Xeon) = 512Kb So… Work in 512K chunks whenever possible (e.g. 256 X 256 double precision) Problem: like swapping, memory pages go in and out of CPU’s cache. Again, the compiler cannot predict the ordering of pages into the cache. Semi-controlled paging leads again to performance degradation Resolution: prevent paging by adjusting the data size to CPU cache Sequential coding practice

  21. 381Mb 244Kb CPU– Intel Pentium4 Speed - 1400MHz L2 Cache - 256 KB Compiler – IntelFortran Optimization – O2 (default) Example Sequential coding practice

  22. 2.3 times less code 516 times slower overall 361 times slower do-loop execution 36 Cache misses 3 Print statements 40 Function calls 3 Print statements Exampleresults Sequential coding practice

  23. Workload summary Adjust to cache size Adjust to pages in sequence Adjust to RAM size Control disk activity fastest slowest

  24. Sparse Arrays • Current– Dense (full) arrays • All array indices are occupied in memory • Matrix manipulations are usually element by element (no linear algebra manipulations when handling parameters on the grid) Sequential coding practice

  25. Dense Arrays in HUCM:Cloud drop size distribution (F1) Number of nonzeros ~ 110,000 Load = 5% Number of nonzeros ~ 3,700 Load = 0.2% Sequential coding practice

  26. Dense Arrays in HUCM:Cloud drop size distribution (F1)Lots of LHOLEs Number of nonzeros ~ 110,000 Load = 14% Number of nonzeros ~ 3,700 Load = 0.5% Sequential coding practice

  27. Sparse Arrays • Current– Dense (full) arrays • All array subscripts occupy memory • Matrix manipulations are usually element by element (no linear algebra manipulations when handling parameters on the grid) • Improvement– Sparse arrays • Only non-zero elements occupy memory cells. Spare notation • When calculating algebraic matrices – run the profiler to check performance degradation due to sparse data Sequential coding practice

  28. Sparse Arrays - HOWTO actual J I val displayed SPARSE is a supported datatype in Intel MathKernel library Sequential coding practice

  29. DO LOOPs • Current– have no respect to memory layout. example – FORTRAN uses column major subscripts Memory layout Virtual layout: a 2D array (Column major) Sequential coding practice

  30. Page limit (16k) DO LOOPs • Order of the subscript is crucial • Data pointer advances many steps • Many page faults Memory layout 1 2 Virtual layout: a 2D array (Column major) Sequential coding practice

  31. DO LOOPs • Order of the subscript is crucial Memory layout 2 1 Virtual layout: a 2D array (Column major) Sequential coding practice

  32. DO LOOPs - example 125Mb Sequential coding practice

  33. 42 times more idle crunching (an order of magnitude) DO LOOPs Wall-clock time the do-loop the ‘print’ statement A system call Sequential coding practice

  34. DO LOOPs • Improvements • Reorder the DO LOOPs or • Rearrange the dimensions in an array: GFF2R(NI, NKR, NK, ICEMAX) -> GFF2R(ICEMAX, NKR, NK, NI) Innermost (fastest) running subscript Outermost (slowest) running subscript Sequential coding practice

  35. Parallel Coding Practice

  36. Job Scheduling • Current • Manual batch: hard to track, no monitoring of the control • Improvements: • Batch scheduling / parameter sweep (e.g. shell scripts, NIMROD) • EASY/MAUI backfilling job scheduler Parallel coding practice

  37. Load balancing • Current • Administrative - manual (and rough) load balancing: Haim • MPI, PVM,… libraries –no load balancing capabilities, software dependent: • RAMS – variable grid point area • MM5, MPP - ? • WRF - ? • File system– NFS A disaster!!!: client side caching, no segmented file locks, network congestion • Improvements: • MOSIX: kernel level governing, better monitoring of jobs, no stray (defunct) residues • MOPI – DFSA (not PVFS, and definitely not NFS) Parallel coding practice

  38. NFS – client side cache every node has a non-concurrent mirror of the image • Write– 2 writes to the same location may crash the system • Read– old data may be read

  39. Nearly Local bus rate Cannot perform better than network communication rate Parallel I/O – Local / MOPI MOSIX Parallel I/O System Parallel coding practice

  40. Parallel I/O – Local / MOPI • Local – can be adapted with minor change in source code • MOPI - Needs installation but requires no changes in source code

  41. converting sequential to parallel An easy 5-step method • Hotspot identification • Partition • Communication • Agglomeration • mapping Parallel coding practice

  42. Parallelizing should be done methodically in a clean, accurate and meticulous way. However intuitive parallel programming is, it does not always allow straightforward automatic mechanical methods. One of the approaches - the methodical approach (Ian Foster): This particular method maximizes the potential for parallelizing and provide efficient steps that exploit this potential. Furthermore, it provides explicit checklists on completion of each step (not detailed here). Parallel coding practice

  43. 5-step hotspots Identify the hotspots – identify the parts of a program which consume the most run time. Our goal here is to know which code segments can and should be parallelized. Why? For e.g.: Greatly improve code that consumes 10% of the run time may increase performance by 10% whereas optimizing code that consumes 90% of the runtime may enable an order of magnitude speedup. How? Algorithm inspection (in theory) By looking at the code By Profiling (tools such as prof or another 3rd party) to identify bottlenecks Parallel coding practice

  44. 5-step partition1 Definition: The ratio between computation and communication is known as granularity Parallel coding practice

  45. 5-step partition2 • Goal Partition the tasks into the most fine grain ones. • Why? • We want to discover all the available opportunities for parallel execution, and to provide flexibility when we introduce the following steps (communication, memory and other requirements will enforce the optimal agglomeration and mapping) • How? • Functional Parallelism • Data Parallelism • Data decomposition – sometimes its easier to start off with partitioning the data into segments which are not mutually dependent Parallel coding practice

  46. 5-step partition3 Parallel coding practice

  47. 5-step partition4 • Goal Partition the tasks into the most fine grain ones. • Why? • We want to discover all the available opportunities for parallel execution, and to provide flexibility when we introduce the following steps (communication, memory and other requirements will enforce the optimal agglomeration and mapping) • How? • Functional Parallelism • Data Parallelism • Functional decomposition –partitioning the calculation into segments which are not mutually dependent (e.g. integration components are evaluated before the integration step) Parallel coding practice

  48. 5-step partition5 Parallel coding practice

  49. 5-step communication1 • Communication occurs during data passing and synchronization. We strive to minimize data communication between tasks or make them more coarse-grained • Sometimes the master process may encounter too much traffic coming in: If large data chunks must be transferred try to form hierarchies in aggregating the data • The most efficient granularity is dependent upon the algorithm and the hardware environment in which it runs • Decomposing the data has a crucial role here, consider revisiting step 2 Parallel coding practice

  50. 5-step communication2 Sending data out to sub-tasks: Point-to-point is best for sending personalized data to each independent task broadcast is good way to clog the network (all processors update the data, then need to send it back to the master) but we may find good use for it when a large computation can be performed once and lookup tables can be sent across the network Collection is usually used to perform mathematics like min, max, sum… Shared memory systems synchronize using the memory locking techniques Distributed memory systems may use blocking or non-blocking message passing. Blocking MP may be used for synchronization Parallel coding practice

More Related