1 / 59

CS 420 - Design of Algorithms

CS 420 - Design of Algorithms. Parallel Computer Architecture and Software Models. Parallel Computing – It’s about performance. Greater performance is the reason for parallel computing Many types of scientific and engineering programs are too large and too complex for traditional uniprocessors

jane
Download Presentation

CS 420 - Design of Algorithms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 420 - Design of Algorithms Parallel Computer Architecture and Software Models

  2. Parallel Computing –It’s about performance • Greater performance is the reason for parallel computing • Many types of scientific and engineering programs are too large and too complex for traditional uniprocessors • Such large problems are common is – • Ocean modeling, weather modeling, astrophysics, solid state physics, power systems, CFD….

  3. FLOPS – a measure of performance • FLOPS – Floating Point Operations per Second • … a measure of how much computation can be done in a certain amount of time • MegaFLOPS – MFLOPS - 106 FLOPS • GigaFLOPS – GFLOPS – 109 FLOPS • TeraFLOPS – TFLOPS – 1012 FLOPS • PetaFLOPS – PFLOPS – 1015 FLOPS

  4. How fast … • Cray 1 - ~150 MFLOPS • Pentium 4 – 3-6 GFLOPS • IBM’s BlueGene - +360 TFLOPS • PSC’s Big Ben – 10 TFLOPS • Humans --- it depends • as calculators – 0.001 MFLOPS • as information processors – 10PFLOPS

  5. FLOPS vs. MIPS • FLOPS only concerned with floating pointing calculations • other performance issues • memory latency • cache performance • I/O capacity • Interconnect

  6. See… • www.Top500.org • biannual performance reports and … • rankings of the fastest computers in the world

  7. Performance • Speedup(n processors) = • time(1 processor)/time(n processors) • ** Culler, Singh and Gupta, Parallel Computing Architecture, A Hardware/Software Approach

  8. Consider… from: www.lib.utexas.edu/maps/indian_ocean.html

  9. … a model of the Indian Ocean - • 73,000,000 square kilometer • One data point per 100 meters • 7,300,000,000 surface points • Need to model the ocean at depth – say every 10 meters up to 200 meters • 20 depth data points • Every 10 minutes for 4 hours – • 24 time steps

  10. So – • 73 x 106 (points on the surface) x 102 (points per sq. km) x 20 points per sq km of depth) x 24 (time steps) • 3,504,000,000,000 data points in the model grid • Suppose calculations of 100 instructions per grid point • 350,400,000,000,000 instructions in model

  11. Then - • Imagine that you have a computer that can run 1 billion (109)instructions per second • 3.504 x 1014 / 109 = 35040 seconds • or 9.7 hours

  12. But – • On a 10 teraflops computer – • 3.504 x 1014 / 1013 = 35.0 seconds

  13. Gaining performance • Pipelining • More instructions –faster • More instructions in execution at the same time in a single processor • Not usually an attractive strategy these days – why?

  14. Instruction Level Parallelism (ILP) • based on the fact that many instructions do not depend on instructions that are before them… • Processor has extra hardware to execute several instructions at the same time • …multiple adders…

  15. Pipelining and ILP not the solution to our problem – why? • near incremental improvements in performance • been done already • we need orders of magnitude improvements in performance

  16. Gaining Performance Vector Processors • Scientific and Engineering computations are often vector and matrix operations • graphic transformations – i.e. shift object x to the right • Redundant arithmetic hardware and vector registers to operate on an entire vector in one step (SIMD)

  17. Gaining Performance • Vector Processors • Declining popularity for a while – • Hardware expensive • Popularity returning – • Applications – science, engineering, cryptography, media/graphics • Earth Simulator • your computer?

  18. Parallel Computer Architecture • Shared Memory Architectures • Distributed Memory

  19. Shared Memory Systems • Multiple processors connected to/share the same pool of memory • SMP • Every processor has, potentially, access to and control of every memory location

  20. Shared Memory Computers Processor Processor Processor Memory Processor Processor Processor

  21. Shared Memory Computers Memory Memory Memory Processor Processor Processor

  22. Shared Memory Computer Memory Memory Memory Switch Processor Processor Processor

  23. Share Memory Computers • SGI Origin2000 – at NCSA • Balder • 256 250mhz R10000 processors • 128 Gbyte Memory

  24. Shared Memory Computers • Rachel at PSC • 64 1.15 Ghz EV7 processors • 256 Gbytes of shared memory

  25. Distributed Memory Systems • Multiple processors each with their own memory • Interconnected to share/exchange data, processing • Modern architectural approach to supercomputers • Supercomputers and Clusters similar • **Hybrid distributed/shared memory

  26. Clusters – distributed memory Memory Memory Memory Processor Processor Processor Interconnect Processor Processor Processor Memory Memory Memory

  27. ClusterDistributed Memory with SMP Memory Memory Memory Proc1 Proc2 Proc1 Proc2 Proc1 Proc2 Interconnect Proc1 Proc2 Proc1 Proc2 Proc1 Proc2 Memory Memory Memory

  28. Distributed Memory Supercomputer • BlueGene/L • DOE/IBM • 0.7 Ghz PowerPC 440 • 131072 Processors • previous - 32768 Processors • 367 Teraflops • was 70 TFlops

  29. Distributed Memory Supercomputer • Thunder at LLNL • Number 19 • was Number 5 • 20 Teraflops • 1.4 Ghz Itanium processors • 4096 processors

  30. Earth Simulator • Japan • Built by NEC • Number 14 • was Number 1 • 40 TFlops • 640 Nodes • each node = 8 vector processors • 640x640 full crossbar

  31. Grid Computing Systems • What is a Grid • Means different things to different people • Distributed Processors • Around campus • Around the state • Around the world

  32. Grid Computing Systems • Widely distributed • Loosely connected (i.e. Internet) • No central management

  33. Grid Computing Systems Connected Clusters/other dedicated scientific computers I2/Abilene

  34. Grid Computer Systems Harvested Idle Cycles Internet Control/Scheduler

  35. Grid Computing Systems • Dedicated Grids • TeraGrid • Sabre • NASA Information Power Grid • Cycle Harvesting Grids • Condor • *GlobalGridExchange (Parabon) • Seti@home http://setiathome.berkeley.edu/ • Einstein@home http://einstein.phys.uwm.edu/

  36. Flynn’s Taxonomy *Single Program/Multiple Data - SPMD

  37. SISD – Single Instruction Single Data • Single instruction stream “single instruction execution per clock cycle” • Single data stream – one pieced of data per clock cycle • Deterministic • Tradition CPU, most single CPU PCs Load x to a Load y to b Add B to A Store A Load x to a …

  38. Single Instruction Multiple Data PE-1 PE-2 PE-n • One Instruction stream • Multiple data streams (partitions) • Given instruction operates on multiple data elements • Lockstep • Deterministic • Processor Arrays, Vector Processors • CM-2, Cray-C90 Load A(1) Load A(2) Load A(3) Load B(1) Load B(2) Load B(3) C(1)=A(1)*B(1) C(2)=A(2)*B(2) C(3)=A(3)*B(3) Store C(1) Store C(2) Store C(3) … … … … … …

  39. Multiple Instruction Single Data PE-1 PE-2 PE-n • Multiple instruction streams • Operate on single data stream • Several instructions operate on the same data element – concurrently • A bit strange – CMU • Multi-pass filters • Encryption – code cracking Load A(1) Load A(1) Load A(1) Load B(1) Load B(2) Load B(3) C(1)=A(1)*4 C(2)=A(1)*4 C(3)=A(1)*4 Store C(1) Store C(2) Store C(3) … … … … …

  40. Multiple Instruction Multiple Data PE-1 PE-2 PE-n • Multiple Instruction Streams • Multiple Data Streams • Each processor has own instructions/own data • Most Supercomputers, Clusters, Grids Load A(1) Load G Load B Load B(1) A=SQRT(G) Call func1(B,C) Call func2(C,G) C(1)=A(1)*4 C = A*Pi Store C(1) Store C Store G … …

  41. Single Program Multiple Data PE-1 PE-2 PE-n • Single Code Image/Executable • Each Processor has own data • Instruction execution under program control • DMC, SMP Load A Load A Load A Load B Load B Load B if PE=1 then… if PE=2 then… if PE=n then… C=A*B C=A*B C=A*B Store C Store C Store C … … …

  42. Multiple Program Multiple Data • MPMD like SPMD … • …except each processor run separate, independent executable • How to implement interprocess communications • Socket • MPI-2 – more later SPMD ProgA ProgA ProgA ProgA MPMD ProgA ProgB ProgC ProgD

  43. UMA and NUMA • UMA – Uniform Memory Access • all processors have equal access to memory • Usually found in SMPs • Identical processors • Difficult to implement as n of processors increases • Good processor to memory bandwidth • Cache Coherency CC – • important • can be implemented in hardware

  44. UMA and NUMA • NUMA – Non Uniform Memory Access • Access to memory differs by processor • local processor = good access, nonlocal processors = not so good access • Usually multiple computers or multiple SMPs • Memory access across interconnect is slow • Cache Coherency CC – • can be done • usually not a problem

  45. Let’s revisit speedup… • we can achieve speedup (theoretically) by using more processors,… • but, of factors may limit speedup… • Interprocessor communications • Interprocess synchronization • Load balance • Parallelizability of algorithms

  46. Amdahl’s Law • According to Amdahl’s Law… • Speedup = 1/(S + (1-S)/N) • where • S is the purely sequential part of the program • N is the number of processors

  47. Amdahl’s Law • What does it mean – • Part of a program can is parallelizable • Part of the program must be sequential (S) • Amdahl’s law says – • Speedup is constrained by the portion of the program that must remain sequential relative to the part that is parallelized. • Note: If S is very small – “embarrassingly parallel problem” sometimes anyway!

  48. Software models for parallel computing • Sockets and other P2P models • Threads • Shared Memory • Message Passing • Data Parallel

  49. Sockets and others • TCP Sockets • establish TCP links among processes • send messages through sockets • RPC, CORBA, DCOM • Webservices, SOAP…

  50. Threads • A single executable runs… • … at specific points in execution launches new executables – threads… • … threads can be launched on other PEs • … threads close – control returns to main program • …fork and join • Posix, Microsoft • OpenMP is implemented with threads Thread Threads t0 t1 t2 t3 Threads t0 t1 t2 t3

More Related