250 likes | 343 Views
Explore PRAM model, Flynn classification, Vector Processors, and Communication Networks for enhanced performance. Learn about synchronization, Amdahl’s Law, and memory architectures in parallel computing. Discover the benefits and complexities of various parallel algorithms.
E N D
Higher Level Parallelism • The PRAM Model • Vector Processors • Flynn Classification • Connection Machine CM-2 (SIMD) • Communication Networks • Memory Architectures • Synchronization
Amdahl’s Law • The performance gain by speeding up some operations is limited by the fraction of the time these (faster) operations are used • Speedup = Original T/Improved T • Speedup = Improved Performance/Original Performance
PRAM MODEL • All processors share the same memory space • CRCW • concurrent read, concurrent write • resolution function on collision, (first/or/largest/error) • CREW • concurrent read, exclusive write • EREW • exclusive read, exclusive write
PRAM Algorithm • Same Program/Algorithm in All Processors • Each Processor also have local memory/registers • Ex, Search for one value from in an array • Using p processor • Array size m • p=m 2 Search for the value 2 in the array 3 2 5 7 2 5 1 6
Search CRCW p=m 2 step1: concurrent read A the same memory is accessed by all processors P1 P2 P3 P4 P5 P6 P7 P8 A 2 2 2 2 2 2 2 2 B step2: read B different memory addresses for each processor 3 2 5 7 2 5 1 6 P1 P2 P3 P4 P5 P6 P7 P8 A 2 2 2 2 2 2 2 2 B 3 2 5 7 2 5 1 6
Search CRCW p=m P1 P2 P3 P4 P5 P6 P7 P8 step3: concurrent write write 1 if A=B else 0 A 2 2 2 2 2 2 2 2 B 3 2 5 7 2 5 1 6 We use “or” resolution 1: Value found 0: Value not found 1 • Complexity • All operations performed in constant time • Count only the cost of communication steps • In this case the number of steps is independent of m, (if enough processors) • Search is done in constant time O(1) for CRCW and p=m
Search CREW p=m P1 P2 P3 P4 P5 P6 P7 P8 step3: compute 1 if A=B else 0 2 2 2 2 2 2 2 2 3 2 5 7 2 5 1 6 0 1 0 0 1 0 0 0 Same processors can be reused in the next step! step4.1: read A step4.2: read B step4.3: compute A or B log m steps P1 P2 P3 P4 2 0 0 0 0 1 0 1 0 1 0 1 0 • Complexity • We need log m steps • to “collect” the result • Operations done in constant time • O(log m) complexity 2 P1 P2 P1 2
Search EREW p=m 2 P1 log m steps P1 P2 2 P1 P2 P3 P4 P1 P2 P3 P4 P5 P6 P7 P8 It takes log m steps to distribute the value, more complex? NO, the algorithm is still in O( log m) only the constant differs 2 2
PRAM a Theoretical Model • CRCW • Very elegant • Not of much practical use, (too hard to implement) • CREW • This model can be used to develop algorithms for parallel computers, e.g. our search example • p=1 (a single processor), check all elements give O(m) • p=m (m processors), complexity O(log m), notO(1) • From our example we conclude that even in theory we do not get a m-times “speedup” using m-processors 2 THAT IS ONE BIG PROBLEM WITH PARALLEL COMPUTERS
Parallelism so far • By pipelineing several instructions (at different stages) are executed simultaneously • Pipeline depth limited by hazards • SuperScalar designs provide parallel execution units • Limited by instruction and machine level parallelism • VLIW might improve over hardware instruction issuing • All limited by the instruction fetch mechanism • Called the FLYNN BOTTLENECK • Only a very limited nr of instructions can be fetched each cycle • That makes vector operations ineffective
Vector Processors • Taking Pipelineing to its limits for vector operations • Sometimes referred as a SuperPipeline • The same operation is performed on a vector of data • No data dependencies in the vector data • Ex, add two vectors • Solves the FLYNN BOTTLENECK problem • A loop over a vector can be issued by a singe instruction • Proven to be very effective for scientific calculations • CRAY-1, CRAY-2, CRAY-XMP, CRAY-YMP
Vector Processor (CRAY-1 like) MAIN MEMORY FP add/subtract FP multiply Vector load/store FP divide Integer Vector registers Logical SuperPipelined Arithmetical units Scalar registers (like MIPS reg file)
Vector Operations • Fully Pipelined • CPI = 1, we produce one result each cycle when pipe full • Pipeline Latency • Startup cost = pipeline depth • Vector Add 6 cycles • Vector Multiplication 6 cycles • Vector Divide 20 cycles • Vector Load 12 cycles (depends on memory hierarchy) • Sustained rate • Time/element for a collection of related vector operations
Vector Processor Design • Vector length control • VLR register (Maximum Vector Length, MVL) • Strip Mining in software (Vector > MVL causes a loop) • Stride • How to layout a vectors and matrixes in memory, such that • Memory banks can be accessed without collision • Vector Chaining • Forwarding between vector registers (minimize latency) • Vector Mask Register (Boolean valued) • Conditional writeback, (if 0 no writeback) • Sparse matrixes and conditional execution
Programming • By use of language constructs the compiler is able to utilize the vector functions • FORTRAN is widely used for scientific calculations • built in matrix and vector functions/commands • LINPACK • A library of optimized linear algebra functions • Often used as a benchmark (but does it tell the whole truth?) • Some more (implicite) vectorization possible by advanced compilers
Flynn Classification • SISD (Single Instruction, Single Data) • The MIPS, and even the Vector Processor • SIMD (Single Instruction, Multiple Data) • Each instruction activates several execution units in parallel • MISD (Multiple Instruction, Single Data) • The VLIW architecture might be considered but…. MISD is a seldom used classification • MIMD (Multiple Instruction, Multiple Data) • Multiprocessor architectures • Multi computers (communicating over a LAN), sometimes treated as a separate class of architectures
Communication Bus • Total Bandwidth = Link Bandwidth • Bisection Bandwidth = Link Bandwidth Ring • Total Bandwidth = P * Link Bandwidth • Bisection Bandwidth = 2 * Link Bandwidth Fully Connected • Total Bandwidth = (P * P-1)/2 * Link Bandwidth • Bisection Bandwidth = (P/2) * Link Bandwidth 2
MultiStage Networks Omega Network Crossbar Switch P1 to P2,P3 P2 to P4 P3 to P1 P1 to P6, but P2 to P8 not possible at the same time log P 2 P1 P1 P2 P2 P3 P3 P4 P4 P5 P6 P7 P8
Connection Machines CM-2 (SIMD) 16 1-bit Fully Connected CPUs on each Chip Each CPU has 3 1-bit registers and 64 k-bit memory 3-cube 1024 * Chips 512 FPAs 16k 1-bit CPUs 512 FPAs Front end SISD Sequencer CM-2 uses a 12-cube for communication between the chips 16k 1-bit CPUs 512 FPAs 16k 1-bit CPUs 512 FPAs Data Vault (Disk Array)
SIMD Programming, Parallel sum sum=0 for (i=0;i<65536;i=i+1) /* Loop over 65k elements */ sum=sum+A[Pn,i]; /* Pn is the processor number */ limit=8192; half=limit; /* Collect sum from 8192 processors */ repeat half=half/2 /* Split into sender/receiver */ if (Pn>=half && Pn<limit) send(Pn/2-half,sum); if (Pn<half) sum=sum+receive(); limit=half; until (half==1) /* final sum */ limit 4 3 send(1,sum) half 2 send(0,sum) limit 2 1 sum=sum+R half 1 send(0,sum) 0 sum=sum+R 0 sum=sum+R 0 Final sum
SIMD vs MIMD • SIMD • Single Instruction (one PC) • All processors perform the same work (synchronized) • Conditional execution (case/if etc) • Each processor holds a enable bit • MIMD • Each processor has a PC • Possible to run different programs: BUT • All may run the same program (SPMD), single Program ... • Use MIMD style programming for conditional execution • Use SIMD style programming for synchronized actions
Memory Architectures for MIMD • Centralized • We use a single bus for all main memory • Uniform memory access, (after passing the local cache) • Distributed • The sought address might be hosted by another processor • Non-uniform memory access, (dynamic “find” time) • The Extreme, a cache only Memory • Shared • All processors shared the same address space • Memory can be used for communication • Private • All processors have a unique address space • Communication must be done by “message passing”
Shared Bus MIMD Usually 2-32 P Processor Processor Processor … Snoop Tag Snoop Tag Snoop Tag Cache Cache Cache MEMORY I/O • Cache Coherency Protocol • Write Invalidate • The first write to address A causes all other cached references of A to be invalidated • Write Update • On write to address A all cached references of A is updated (high bus activity) • On a cache read miss when using WB caches • The cache holding the valid data writes to memory • The cache holding the valid data writes directly to the cache requiring the data
Synchronization • When using shared data we need to se that only one processor can access the data when updating • We need an atomic operation for TEST&SET Processor 2 Processor 1 loop: TEST&SET A.lock beq A.go loop update A clear A.lock loop: TEST&SET A.lock beq A.go loop update A clear A.lock Processor 1 gets the lock (A.go) updates the shared data and finally clears the lock (A.lock) Processor B spin-waits until lock released updates shaded data and releases lock