1 / 24

Higher Level Parallelism

Higher Level Parallelism. The PRAM Model Vector Processors Flynn Classification Connection Machine CM-2 (SIMD) Communication Networks Memory Architectures Synchronization. Amdahl’s Law.

jerzy
Download Presentation

Higher Level Parallelism

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Higher Level Parallelism • The PRAM Model • Vector Processors • Flynn Classification • Connection Machine CM-2 (SIMD) • Communication Networks • Memory Architectures • Synchronization

  2. Amdahl’s Law • The performance gain by speeding up some operations is limited by the fraction of the time these (faster) operations are used • Speedup = Original T/Improved T • Speedup = Improved Performance/Original Performance

  3. PRAM MODEL • All processors share the same memory space • CRCW • concurrent read, concurrent write • resolution function on collision, (first/or/largest/error) • CREW • concurrent read, exclusive write • EREW • exclusive read, exclusive write

  4. PRAM Algorithm • Same Program/Algorithm in All Processors • Each Processor also have local memory/registers • Ex, Search for one value from in an array • Using p processor • Array size m • p=m 2 Search for the value 2 in the array 3 2 5 7 2 5 1 6

  5. Search CRCW p=m 2 step1: concurrent read A the same memory is accessed by all processors P1 P2 P3 P4 P5 P6 P7 P8 A 2 2 2 2 2 2 2 2 B step2: read B different memory addresses for each processor 3 2 5 7 2 5 1 6 P1 P2 P3 P4 P5 P6 P7 P8 A 2 2 2 2 2 2 2 2 B 3 2 5 7 2 5 1 6

  6. Search CRCW p=m P1 P2 P3 P4 P5 P6 P7 P8 step3: concurrent write write 1 if A=B else 0 A 2 2 2 2 2 2 2 2 B 3 2 5 7 2 5 1 6 We use “or” resolution 1: Value found 0: Value not found 1 • Complexity • All operations performed in constant time • Count only the cost of communication steps • In this case the number of steps is independent of m, (if enough processors) • Search is done in constant time O(1) for CRCW and p=m

  7. Search CREW p=m P1 P2 P3 P4 P5 P6 P7 P8 step3: compute 1 if A=B else 0 2 2 2 2 2 2 2 2 3 2 5 7 2 5 1 6 0 1 0 0 1 0 0 0 Same processors can be reused in the next step! step4.1: read A step4.2: read B step4.3: compute A or B log m steps P1 P2 P3 P4 2 0 0 0 0 1 0 1 0 1 0 1 0 • Complexity • We need log m steps • to “collect” the result • Operations done in constant time • O(log m) complexity 2 P1 P2 P1 2

  8. Search EREW p=m 2 P1 log m steps P1 P2 2 P1 P2 P3 P4 P1 P2 P3 P4 P5 P6 P7 P8 It takes log m steps to distribute the value, more complex? NO, the algorithm is still in O( log m) only the constant differs 2 2

  9. PRAM a Theoretical Model • CRCW • Very elegant • Not of much practical use, (too hard to implement) • CREW • This model can be used to develop algorithms for parallel computers, e.g. our search example • p=1 (a single processor), check all elements give O(m) • p=m (m processors), complexity O(log m), notO(1) • From our example we conclude that even in theory we do not get a m-times “speedup” using m-processors 2 THAT IS ONE BIG PROBLEM WITH PARALLEL COMPUTERS

  10. Parallelism so far • By pipelineing several instructions (at different stages) are executed simultaneously • Pipeline depth limited by hazards • SuperScalar designs provide parallel execution units • Limited by instruction and machine level parallelism • VLIW might improve over hardware instruction issuing • All limited by the instruction fetch mechanism • Called the FLYNN BOTTLENECK • Only a very limited nr of instructions can be fetched each cycle • That makes vector operations ineffective

  11. Vector Processors • Taking Pipelineing to its limits for vector operations • Sometimes referred as a SuperPipeline • The same operation is performed on a vector of data • No data dependencies in the vector data • Ex, add two vectors • Solves the FLYNN BOTTLENECK problem • A loop over a vector can be issued by a singe instruction • Proven to be very effective for scientific calculations • CRAY-1, CRAY-2, CRAY-XMP, CRAY-YMP

  12. Vector Processor (CRAY-1 like) MAIN MEMORY FP add/subtract FP multiply Vector load/store FP divide Integer Vector registers Logical SuperPipelined Arithmetical units Scalar registers (like MIPS reg file)

  13. Vector Operations • Fully Pipelined • CPI = 1, we produce one result each cycle when pipe full • Pipeline Latency • Startup cost = pipeline depth • Vector Add 6 cycles • Vector Multiplication 6 cycles • Vector Divide 20 cycles • Vector Load 12 cycles (depends on memory hierarchy) • Sustained rate • Time/element for a collection of related vector operations

  14. Vector Processor Design • Vector length control • VLR register (Maximum Vector Length, MVL) • Strip Mining in software (Vector > MVL causes a loop) • Stride • How to layout a vectors and matrixes in memory, such that • Memory banks can be accessed without collision • Vector Chaining • Forwarding between vector registers (minimize latency) • Vector Mask Register (Boolean valued) • Conditional writeback, (if 0 no writeback) • Sparse matrixes and conditional execution

  15. Programming • By use of language constructs the compiler is able to utilize the vector functions • FORTRAN is widely used for scientific calculations • built in matrix and vector functions/commands • LINPACK • A library of optimized linear algebra functions • Often used as a benchmark (but does it tell the whole truth?) • Some more (implicite) vectorization possible by advanced compilers

  16. Flynn Classification • SISD (Single Instruction, Single Data) • The MIPS, and even the Vector Processor • SIMD (Single Instruction, Multiple Data) • Each instruction activates several execution units in parallel • MISD (Multiple Instruction, Single Data) • The VLIW architecture might be considered but…. MISD is a seldom used classification • MIMD (Multiple Instruction, Multiple Data) • Multiprocessor architectures • Multi computers (communicating over a LAN), sometimes treated as a separate class of architectures

  17. Communication Bus • Total Bandwidth = Link Bandwidth • Bisection Bandwidth = Link Bandwidth Ring • Total Bandwidth = P * Link Bandwidth • Bisection Bandwidth = 2 * Link Bandwidth Fully Connected • Total Bandwidth = (P * P-1)/2 * Link Bandwidth • Bisection Bandwidth = (P/2) * Link Bandwidth 2

  18. MultiStage Networks Omega Network Crossbar Switch P1 to P2,P3 P2 to P4 P3 to P1 P1 to P6, but P2 to P8 not possible at the same time log P 2 P1 P1 P2 P2 P3 P3 P4 P4 P5 P6 P7 P8

  19. Connection Machines CM-2 (SIMD) 16 1-bit Fully Connected CPUs on each Chip Each CPU has 3 1-bit registers and 64 k-bit memory 3-cube 1024 * Chips 512 FPAs 16k 1-bit CPUs 512 FPAs Front end SISD Sequencer CM-2 uses a 12-cube for communication between the chips 16k 1-bit CPUs 512 FPAs 16k 1-bit CPUs 512 FPAs Data Vault (Disk Array)

  20. SIMD Programming, Parallel sum sum=0 for (i=0;i<65536;i=i+1) /* Loop over 65k elements */ sum=sum+A[Pn,i]; /* Pn is the processor number */ limit=8192; half=limit; /* Collect sum from 8192 processors */ repeat half=half/2 /* Split into sender/receiver */ if (Pn>=half && Pn<limit) send(Pn/2-half,sum); if (Pn<half) sum=sum+receive(); limit=half; until (half==1) /* final sum */ limit 4 3 send(1,sum) half 2 send(0,sum) limit 2 1 sum=sum+R half 1 send(0,sum) 0 sum=sum+R 0 sum=sum+R 0 Final sum

  21. SIMD vs MIMD • SIMD • Single Instruction (one PC) • All processors perform the same work (synchronized) • Conditional execution (case/if etc) • Each processor holds a enable bit • MIMD • Each processor has a PC • Possible to run different programs: BUT • All may run the same program (SPMD), single Program ... • Use MIMD style programming for conditional execution • Use SIMD style programming for synchronized actions

  22. Memory Architectures for MIMD • Centralized • We use a single bus for all main memory • Uniform memory access, (after passing the local cache) • Distributed • The sought address might be hosted by another processor • Non-uniform memory access, (dynamic “find” time) • The Extreme, a cache only Memory • Shared • All processors shared the same address space • Memory can be used for communication • Private • All processors have a unique address space • Communication must be done by “message passing”

  23. Shared Bus MIMD Usually 2-32 P Processor Processor Processor … Snoop Tag Snoop Tag Snoop Tag Cache Cache Cache MEMORY I/O • Cache Coherency Protocol • Write Invalidate • The first write to address A causes all other cached references of A to be invalidated • Write Update • On write to address A all cached references of A is updated (high bus activity) • On a cache read miss when using WB caches • The cache holding the valid data writes to memory • The cache holding the valid data writes directly to the cache requiring the data

  24. Synchronization • When using shared data we need to se that only one processor can access the data when updating • We need an atomic operation for TEST&SET Processor 2 Processor 1 loop: TEST&SET A.lock beq A.go loop update A clear A.lock loop: TEST&SET A.lock beq A.go loop update A clear A.lock Processor 1 gets the lock (A.go) updates the shared data and finally clears the lock (A.lock) Processor B spin-waits until lock released updates shaded data and releases lock

More Related