1 / 37

EECS 570

EECS 570. Notes on Chapter 1– Introduction What is Parallel Architecture? Evolution and convergence of Parallel Architectures Fundamental Design Issues Acknowledgements

winter
Download Presentation

EECS 570

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EECS 570 • Notes on Chapter 1– Introduction • What is Parallel Architecture? • Evolution and convergence of Parallel Architectures • Fundamental Design Issues • Acknowledgements • Slides are derived from work by Steve Reinhardt (Michigan), Mark Hill (Wisconsin)Sarita Adve (Illinois), Babak Falsafi (CMU),Alvy Lebeck (Duke), and J. P. Singh (Princeton). Many Thanks! EECS 570: Fall 2003 -- rev3

  2. What is Parallel Architecture? • parallel – OED • parallel pæ;ra lel, a. and sb. • 2. d. Computers. Involving the concurrent or simultaneous performance of certain operations; functioning in this way. • 1948Math. Tables & Other Aids to Computation III. 149 The use of plugboard facilities and punched cards permits parallel operation (as distinguished from sequence operation), with further gain in efficiency. • 1963 W. H. Ware Digital Computer Technol. & Design II. xi. 3 Parallel arithmetic tends to be faster than serial arithmetic because it performs operations in all columns at once, rather than in one column at a time. • 1974 P. H. Enslow Multiprocessors & Parallel Processing i. 1 This book focuses on..the integration of multiple functional units into a multiprocessing or parallel processing system. EECS 570: Fall 2003 -- rev3

  3. Spectrum of Parallelism 591 470 570 370 • Key differences • granularity of operations • frequency/overhead of communication • degree of parallelism • source of parallelism • data vs. control • parts of larger task vs. independent tasks • source of decomposition (hardware, compiler, programmer, OS …) serial pipelining superscalar VLIW multithreading multiprocessing distributed systems EECS 570: Fall 2003 -- rev3

  4. Course Focus: Multithreading & Multiprocessing • High end: many applications where even a fast CPU isn’t enough • Scientific computing: aerodynamics, crash simulation, drug design, weather prediction, materials, … • General-purpose computing: graphics, databases, web servers, data mining, financial modeling, … • Low end: • cheap microprocessors make small MPs affordable • Future: • Chip multiprocessors (almost here) – CMPs • Multithreading supported on the Pentium 4 • see http://www.intel.com/homepage/land/hyperthreading_more.htm EECS 570: Fall 2003 -- rev3

  5. Motivation • N processors in a computer can provide: • Higher Throughput via many jobs in parallel • individual jobs no faster • Cost-Effectiveness may improve: users share a central resource • Lower Latency from shrink-wrapped software(e.g., Photoshop™) • Parallelizing your application (but this is hard) • From reduced queuing delays • Need something faster than today’s microprocessor? • Wait for tomorrow’s microprocessor • Use many microprocessors in parallel EECS 570: Fall 2003 -- rev3

  6. Historical Perspective • End of uniprocessor performance has been frequently predicted due to fundamental limits • spurred work in parallel processing – cf. Thornton’s arguments for ILP in the CDC 6600 http://www.cs.nmsu.edu/~pfeiffer/classes/473/notes/cdc.html • No common parallel programming model • unlike the von Neumann Model • many models: data parallel, shared memory, message passing, dataflow, systolic, graph reduction, declarative logic • no pre-existing software to target • no common building blocks – high-performance micros are changing this • result: lots of one-of-a-kind architectures with no software base • architecture defines the programming model EECS 570: Fall 2003 -- rev3

  7. What’s different today? Key: a microprocessor is now the fastest uniprocessor you can build • insurmountable handicap to build on anything else • Amdahl’s law • favorable performance per $ • gpp’s enjoy volume production • small-scale bus-based shared memory is well understood • P6 (Pentium Pro/II/III) supports 4-way “glueless” MP • supported by most common OS’s (e.g. NT, Solaris, Linux) EECS 570: Fall 2003 -- rev3

  8. Technology Trends The natural building block for multiprocessors is now about the fastest EECS 570: Fall 2003 -- rev3

  9. What's different today? (cont'd) • Meanwhile, programming models have converged to a few: • shared memory (better: shared address space) • message passing • data parallel (compiler maps to one of above) • data flow (more as concept than model) • Result: parallel system is microprocessors + memory + interconnection network • Still many design issues to consider EECS 570: Fall 2003 -- rev3

  10. Parallel Architecture Today • Key: abstractions & interfaces for communication and cooperation • Communication Architecture • equivalent to Instruction Set Architecture for uniprocessors • Must consider • Usability (programmability) & performance • Feasibility/complexity of implementation (hw or sw) • Compilers, libraries and OS are important bridges today EECS 570: Fall 2003 -- rev3

  11. Modern Layered Framework EECS 570: Fall 2003 -- rev3

  12. Survey of Programming Models • Shared Address Space • Message Passing • Data Parallel • Others: • Dataflow • Systolic Arrays (see text) Examine programming model, motivation, intended applications, and contributions to convergence EECS 570: Fall 2003 -- rev3

  13. Simple Example int i; double a, x[N], y[N], z[N], sum; /* input a, x[] , y[] */ sum = 0; for (i = 0; i< N; ++i) { z[i] = a * x[i] + y[i]; sum += z[i]; } EECS 570: Fall 2003 -- rev3

  14. X[3] A X[2] A * * Y[3] Y[2] + + Dataflow Graph X[1] X[0] A A * * Y[1] Y[0] + + + + … + 2 + N-1 cycles to execute on N processors what assumptions? EECS 570: Fall 2003 -- rev3

  15. Shared Address Space Architectures Any processor can directly reference any memory location • Communication occurs implicitly as result of loads and stores • Need additional synchronization operations Convenient: • Location transparency • Similar programming model to time-sharing on uniprocessors • Except processes run on different processors • Good throughput on multiprogrammed workloads Within one process (“lightweight” threads): all memory shared Among processes: portions of address spaces shared (mmap, shmat) • In either case, variables may he logically global or per-thread • Popularly known as shared memory machines or model • Ambiguous: memory may be physically distributed among processors EECS 570: Fall 2003 -- rev3

  16. Small-scale Implementation Natural extension of uniprocessor: already have processor, memory modules and I/O controllers on interconnect of some sort • typically a bus • may be crossbar (mainframes) • occasionally multistage network (vector machines, ??) Just add processors! I/O devices mem mem mem mem I/O ctrl I/O ctrl interconnect processor processor EECS 570: Fall 2003 -- rev3

  17. Simple Example: SAS version /* per-thread */ int i, my_start, my_end, me; /* global */ double a, x[N], y[N], z[N], sum; /* my_start, my_end based on N, # nodes */ for (i = my_start; i< my_end; ++i) z[i] = a* x[i] + y[iJ.; BARRIER; if (me == 0) sum = 0; for (i = 0; i< N; ++i) sum += z[i]; EECS 570: Fall 2003 -- rev3

  18. Message Passing Architectures Complete computer as building block • ld/st access only private address space (local memory) • Communication via explicit I/O operations (send/receive) • Specify local memory buffers • Synchronization implicit in msgs Programming interface often more removed from basic hardware • Library and/or OS intervention Biggest machines are of this sort • IBM SP/2 • DoE ASCI program • Clusters (Beowulf etc.) EECS 570: Fall 2003 -- rev3

  19. Simple Example: MP version int i, me; double a, x[N/P], y[N/P], z[N/P] , sum; sum = 0 ; for (i = 0; i < NIP; ++i) { z[i] = a* x[i] + y[i]; sum += z[i]; } if (me != 0) send (sum, 0); else for (i = 0; i< P; ++i) { recv(tmp, i); sum += tmp; } EECS 570: Fall 2003 -- rev3

  20. Convergence: Scaling Up SAS • Problem is interconnect: cost (crossbar) or bandwidth (bus) • Distributed memory or non-uniform memory access (NUMA) • “Communication assist” turns non-local accesses into simple message transactions (e.g., read-request, read-response) • issues: cache coherence, remote memory latency • MP HW specialized for read/write requests mem mem mem interconnect interconnect mem $ mem $ mem $ proc proc proc $ $ $ proc proc proc Distributed memory Dance hall EECS 570: Fall 2003 -- rev3

  21. Separation of Architecture from Model At the lowest level SM sends messages • HW is specialized to expedite read/write messages What programming model/ abstraction is supported at user level? Can I have shared-memory abstraction on message passing HW? Can I have message passing abstraction on shared memory HW? Recent research machines integrate both • Alewife, Tempest/Typhoon, FLASH EECS 570: Fall 2003 -- rev3

  22. Data Parallel Systems Programming model • Operations performed in parallel on each element of data structure • Logically single thread of control, performs sequential or parallel steps • Synchronization implicit in sequencing • Conceptually, a processor associated with each data element Architectural model • Array of many simple, cheap processing • elements (PE’s, really just datapaths) with no instruction memory, little data memory each. • Attached to a control processor that issues instructions • Specialized and general communication, cheap global synch. pe pe pe pe pe pe pe pe pe EECS 570: Fall 2003 -- rev3

  23. Simple Example: DP version double a, x[N], y[N], z[N], sum; z = a * x+ y; sum = reduce(+, z); Language supports array assignment, global operations Other examples: Document searching, image processing, ... Some recent (within last decade+) machines: • Thinking Machines CM-I, CM-2 (and CM-5) • Maspar MP-1 and MP-2 EECS 570: Fall 2003 -- rev3

  24. DP Convergence with SAS/MP Popular when cost savings of centralized sequencer high • 60’s when CPU was a cabinet • Replaced by vectors in mid-7Os • More flexible w.r.t. memory layout and easier to manage • Revived in mid-80’s when datapath just fit on chip (w/o control) • No longer true with modem microprocessors Other reasons for demise • DP applications are simple, regular • relatively easy targets for compilers • easy to partition across relatively small # of microprocessors • MIMD machines effective for these apps plus many others Contributions to convergence • utility of fast global synchronization, reductions, etc. • high-Level model that can compile to either platform EECS 570: Fall 2003 -- rev3

  25. Dataflow Architectures Represent computation as a graph of essential dependences • Logical processor at each node activated by availability of operands • Message (tokens) carrying tag of next instruction sent to next processor • Tag compared with others in matching store; match fires execution EECS 570: Fall 2003 -- rev3

  26. Basic Dataflow Architecture EECS 570: Fall 2003 -- rev3

  27. DF Evolution and Convergence Key characteristics • high parallelism: no artificial limitations • dynamic scheduling. fully exploit multiple execution units Problems • Operations have locality, nice to group them to reduce communication • No conventional notion of memory – how do you declare an array? • Complexity of matching store: large associative search! • Too much parallelism! ! ! mgmt overhead > benefit Converged to use conventional processors and memory • Group related ops to exploit registers, cache – fine-grain threading • Results communicated between threads via messages Lasting contributions: • Stresses tightly integrated communication & execution (e.g. create thread to handle message) • Remains useful concept for ILP hardware, compilers EECS 570: Fall 2003 -- rev3

  28. Programming Model Design Issues • Naming: How is communicated data and/or partner node referenced? • Operations: What operations are allowed on named data? • Ordering: How can producers and consumers of data coordinate their activities? • Performance • Latency. How long does it take to communicate in a protected fashion? • Bandwidth: How much data can be communicated per second? How many operations per second? EECS 570: Fall 2003 -- rev3

  29. Issue: Naming Single Global Linear-Address-Space (shared memory) Single Global Segmented-Name-Space (global objects/data parallel) • uniform address space • uniform accessibility (load/store) Multiple Local Address/Name Spaces (message passing) • two-level address space (node + memory address) • non-uniform accessibility (use messages if node != me) Naming strategy affects • Programmer/Software • Performance • Design Complexity EECS 570: Fall 2003 -- rev3

  30. Issue: Operations • SAS • Id/st, arithmetic on any item (in source language) • additional ops for synchronization (locks, etc.), usually on memory locations • Message passing • Id/st, arithmetics etc. only on local items • send/recv on (local memory range, remote node ID) tuple • Data parallel • arithmetics etc. • global operations (sum, max, min, etc.) EECS 570: Fall 2003 -- rev3

  31. Ordering • Uniprocessor • program order of instructions (note: specifies effect not reality) • SAS • uniprocessor within thread • implicit memory ordering among threads very subtle • need explicit synchronization operations • Message passing • uniprocessor within node; can't recv before send • Data parallel • program order of operations (just like uni) • all parallelism is within individual operations • implicit global barrier after every step EECS 570: Fall 2003 -- rev3

  32. Issue: Order/Synchronization Coordination mainly takes three forms: • mutual exclusion (e.g., spin-locks) • event notification • point-to-point (e.g.., producer-consumer) • global (e.g., end of pbase indication, all or subset of processes) • global operations (e.g., sum) Issues: • synchronization name space (entire address space or portion) • granularity (per byte, per word. ...=> overhead) • low latency, low serialization (hot spots) • variety of approaches • test&set, compare&swap, ldLocked-stConditional • Full Empty bits and traps • queue-based locks, fetch&op with combining EECS 570: Fall 2003 -- rev3

  33. Communication Performance Performance characteristics determine usage of operations at a layer • Programmer, compilers must choose strategies leading to performance Fundamentally, three characteristics: • Latency: time from send to receive • Bandwidth: max transmission rate (bytes/sec) • Cost: impact on execution time of program If processor does one thing at a time: bandwidth  I/latency • But actually more complex in modern systems EECS 570: Fall 2003 -- rev3

  34. Communication Cost Model • Communication time for one n-byte message: Comm Time = latency + n/bandwidth • Latency has two parts: • overhead is time the CPU is busy (protection checks, formatting header, copying data, etc) • rest of latency can be lumped as network delay • Bandwidth is determined by communication bottleneck • occupancy of a component is amount of time that component spends dedicated to one message • in steady state, can't do better than 1/(max occupancy) EECS 570: Fall 2003 -- rev3

  35. Cost Model (cont'd) Overall execution-time impact depends on: • amount of communication • amount of comm. time hidden by other useful work (overlap) comm cost = frequency * (comm time - overlap) Note that: • overlap is limited by overhead • overlap is another form of parallelism EECS 570: Fall 2003 -- rev3

  36. Replication • Very important technique for reducing communication frequency • Depends on naming model • Uniprocessor: caches do it automatically & transparently • SAS: uniform naming allows transparent replication • Caches again • OS can do it at page level (useful with distributed memory) • Many copies for same name: coherence problem • Message Passing: • Send/receive replicates, giving data a new name- not transparent • Software at some level must manage (programmer, library, compiler) EECS 570: Fall 2003 -- rev3

  37. Summary of Design Issues Functional and performance issues apply at all layers Functional: Naming, operations and ordering Performance: Organization, latency, bandwidth, overhead, occupancy Replication and communication are deeply related • Management depends on naming model Goal of architects: design against frequency and type of operations that occur at communication abstraction, constrained by tradeoffs from above or below • Hardware/software tradeoffs EECS 570: Fall 2003 -- rev3

More Related