1 / 71

Platform Design

Platform Design. Multi-Processor Systems-on-Chip MPSoC. TU/e 5kk70 Henk Corporaal Bart Mesman. Overview. What is a platform, and why platform based design? Why parallel platforms? A first classification of parallel systems Design choices for parallel systems Shared memory systems

strunk
Download Presentation

Platform Design

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Platform Design Multi-Processor Systems-on-Chip MPSoC TU/e 5kk70 Henk Corporaal Bart Mesman

  2. Overview • What is a platform, and why platform based design? • Why parallel platforms? • A first classification of parallel systems • Design choices for parallel systems • Shared memory systems • Memory Coherency, Consistency, Synchronization, Mutual exlusion • Message passing systems • Further decisions Platform Design H. Corporaal and B. Mesman

  3. Design & Product requirements? • Short Time-to-Market • Reuse / Standards • Short design time • Flexible solution • Reduces design time • Extends product lifetime; remote inspect and debug, … • Scalability • High performance and Low power • Memory bottleneck, Wiring bottleneck • Low cost • High quality, reliability, dependability • RTOS and libs • Good programming environment Platform Design H. Corporaal and B. Mesman

  4. Solution ? • Platforms • Programmable • One or more processor cores • Reconfigurable • Scalable and flexible • Memory hierarchy • Exploit locality • Separate local and global wiring • HW and SW IP reuse • Standardization (on SW and HW-interfaces) • Raising design abstraction level • Reliable • Cheaper • Advanced Design Flow for Platforms Platform Design H. Corporaal and B. Mesman

  5. What is a platform? Definition: A platform is a generic, but domain specific information processing (sub-)system Generic means that it is flexible, containing programmable component(s). Platforms are meant to quickly realize your next system (in a certain domain). Single chip? Platform Design H. Corporaal and B. Mesman

  6. Example Platform: Sanyo Camera Platform Design H. Corporaal and B. Mesman

  7. 192Kbyte shared SRAM 8Kb data cache (2-way, 512 lines of 16 bytes) Write buffer (17 elements) 16Kb (2-way) 16Kb (2-way) 8Kb mem (2x 4K) 64Kb dual port (8x 4K x 16b) 96Kb single port (12x 4k x 16b) 32Kb ROM Platform example: TI OMAP Up to 192Mbyte off-chip memory Platform Design H. Corporaal and B. Mesman

  8. Platform and platform design Applications SDT system design technology Design technology Platform PDT platform design technology Enabling technologies Platform Design H. Corporaal and B. Mesman

  9. Why parallel processing • Performance drive • Diminishing returns for exploiting ILP and OLP • Multiple processors fit easily on a chip • Cost effective (just connect existing processors or processor cores) • Low power: parallelism may allow lowering Vdd However: • Parallel programming is hard Platform Design H. Corporaal and B. Mesman

  10. Low power through parallelism • Sequential Processor • Switching capacitance C • Frequency f • Voltage V • P = fCV2 • Parallel Processor (two times the number of units) • Switching capacitance 2C • Frequency f/2 • Voltage V’ < V • P = f/2 2C V’2 =fCV’2 Platform Design H. Corporaal and B. Mesman

  11. Power efficiency: compare 2 examples • Intel Pentium-4 (Northwood) in 0.13 micron technology • 3.0 GHz • 20 pipeline stages • Aggressive buffering to boost clock frequency • 13 nano Joule / instruction • Philips Trimedia “Lite” in 0.13 micron technology • 250 MHz • 8 pipeline stages • Relaxed buffering, focus on instruction parallelism • 0.2 nano Joule / instruction • Trimedia is doing 65x better than Pentium Platform Design H. Corporaal and B. Mesman

  12. Parallel Architecture • Parallel Architecture extends traditional computer architecture with a communication network • abstractions (HW/SW interface) • organizational structure to realize abstraction efficiently Communication Network Processing node Processing node Processing node Processing node Processing node Platform Design H. Corporaal and B. Mesman

  13. Platform characteristics • System level • Processor level • Communication network • Memory system • Tooling Platform Design H. Corporaal and B. Mesman

  14. System level characteristics • Homogeneous  Heterogeneous • Granularity of processing elements • Type of supported parallelism: TLP, DLP • Runtime mapping support? Platform Design H. Corporaal and B. Mesman

  15. Homogeneous or Heterogeneous • Homogenous: • replication effect • memory dominated any way • solve realization issuesonce and for all • less flexible • Typically: • data level parallelism • shared memory • dynamic task mapping Platform Design H. Corporaal and B. Mesman

  16. TM TM TM TM TM TM TM memory ARM pixel simd video scale picture improve Example: Philips Wasabi • Homogeneous multiprocessor for media applications • Two-level communication hierarchy • Top: scalable message passingnetwork plus tiles • Tile: shared memory plus processors, accelerators • Fully cache coherent to support data parallelism

  17. Homogeneous or Heterogeneous • Heterogeneous • better fit to application domain • smaller increments • Typically: • task level parallelism • message passing • static task mapping Platform Design H. Corporaal and B. Mesman

  18. MBS VMPG TM3260 TDCS VIP MIPS PR4450 TM3260 QVCP5L MSP MDCS QVCP2L Example: Viper2 • Heterogeneous • Platform based • >60 different cores • Task parallelism • Sync with interrupts • Streaming communication • Semi-static application graph • 50 M transistors • 120nm technology • Powerful, efficient

  19. Homogeneous or Heterogeneous • Middle of the road approach • Flexibile tiles • Fixed tile structure at top level Platform Design H. Corporaal and B. Mesman

  20. DLP Homogenous SIMD / Vector Module level Types of parallelism TLP Heterogenous Multi-threaded / MIMD Program/Thread level Kernel level ILP Heterogenous VLIW / Superscalar/ Dataflow arch. Platform Design H. Corporaal and B. Mesman

  21. Processor level characteristics Processor consists of • Instruction engine (Control Processor, Ifetch unit) • Processing element (PE): Register file, Function unit(s), L1 DMem • Single PE  Multiple PEs (as in SIMD) • Single FU/PE  Multiple FUs/PE (as in VLIW) • Granularity of PEs, FUs • Specialized  Generic • Interruptable, pre-emption support • Multithreading support (fast context switches) • Clustering of PEs; Clustering of FUs • Type of inter PE and inter FU communication network • Others: MMU – virtual memory, ….. Platform Design H. Corporaal and B. Mesman

  22. Generic or Specialized?Intrinsic computational efficiency Platform Design H. Corporaal and B. Mesman

  23. 0 M u x 1 I F / I D I D / E X E X / M E M M E M / W B A d d A d d 4 A d d r e s u l t S h i f t l e f t 2 R e a d n o r e g i s t e r 1 i A d d r e s s P C t R e a d c u d a t a 1 r t R e a d s Z e r o n r e g i s t e r 2 I I n s t r u c t i o n R e g i s t e r s A L U R e a d A L U m e m o r y 0 R e a d W r i t e A d d r e s s d a t a 2 1 r e s u l t d a t a r e g i s t e r M M u D a t a u W r i t e x m e m o r y x d a t a 1 0 W r i t e d a t a 1 6 3 2 S i g n e x t e n d General processor organization PE: processing engine Instruction fetch - Control FU Platform Design H. Corporaal and B. Mesman

  24. DMem DMem DMem DMem DMem DMem DMem DMem DMem RF RF RF RF RF RF RF RF RF FU FU FU FU FU FU FU FU FU (Linear) SIMD Architecture Control Processor IMem PE1 PEn • To be added: • inter PE communication • communication from PEs to Control Processor • Input and Output Platform Design H. Corporaal and B. Mesman

  25. Communication network • Bus (single all2all connection)  Crossbar  NoC with point-to-point connections • Topology, Router degree • Routing • path, path control, collision resolvement, network support, deadlock handling, livelock handling • virtual layer support • flow control and buffering • error handling • Inter-chip network support • Guarantees • TDMA • GT  BE traffic • etc, etc. Platform Design H. Corporaal and B. Mesman

  26. Comm. Network: Performance metrics • Network Bandwidth • Need high bandwidth in communication • How does it scale with number of nodes? • Communication Latency • Affects performance, since processor may have to wait • Affects ease of programming, since it requires more thought to overlap communication and computation • Latency Hiding • Global memory access can take hundreds of cycles • How can a mechanism help hide latency? • Examples: • overlap message send with computation, • prefetch data, • switch to other tasks Platform Design H. Corporaal and B. Mesman

  27. How good is your network? Topology determines: • Degree = number of links from a node • Diameter = max number of links crossed between nodes • Average distance = number of links to random destination • Bisection = minimum number of links that separate the network into two halves • Bisection bandwidth = link bandwidth x bisection Platform Design H. Corporaal and B. Mesman

  28. Metrics for common topologies Type Degree Diameter Ave Dist Bisection 1D mesh 2 N-1 N/3 1 2D mesh 4 2(N1/2 - 1) 2N1/2 / 3 N1/2 3D mesh 6 3(N1/3 - 1) 3N1/3 / 3 N2/3 nD mesh 2n n(N1/n - 1) nN1/n / 3 N(n-1) / n Ring 2 N/2 N/4 2 2D torus 4 N1/2 N1/2 / 2 2N1/2 Hypercube Log2N n=Log2N n/2 N/2 2D Tree 3 2Log2N ~2Log2 N 1 Crossbar N-1 1 1 N2/2 N = number of nodes, n = dimension Platform Design H. Corporaal and B. Mesman

  29. More topology metrics Hypercube Grid/Mesh Torus Assume 64 nodes: Platform Design H. Corporaal and B. Mesman

  30. How to make a bigger butterfly network? N/2 Butterfly ° ° ° N/2 Butterfly ° ° ° Multi-stage network: Butterfly or Omega • All paths equal length • Unique path from any input to any output • Try to avoid conflicts !! 8 x 8 butterfly switch Platform Design H. Corporaal and B. Mesman

  31. Multistage Fat Tree • A multistage fat tree (CM-5) avoids congestion at the root node • Randomly assign packets to different paths on way up to spread the load • Increase degree near root, decrease congestion Platform Design H. Corporaal and B. Mesman

  32. What did architects design in the 90ties?Old (off-chip) MP Networks Name Number Topology Bits Clock Link Bis. BW Year nCube/ten 1-1024 10-cube 1 10 MHz 1.2 640 1987 iPSC/2 16-128 7-cube 1 16 MHz 2 345 1988 MP-1216 32-512 2D grid 1 25 MHz 3 1,300 1989 Delta 540 2D grid 16 40 MHz 40 640 1991 CM-5 32-2048 fat tree 4 40 MHz 20 10,240 1991 CS-2 32-1024 fat tree 8 70 MHz 50 50,000 1992 Paragon 4-1024 2D grid 16 100 MHz 200 6,400 1992 T3D 16-1024 3D Torus 16 150 MHz 300 19,200 1993 MBytes/s No standard topology! However, for on-chip: mesh and torus are in favor ! Platform Design H. Corporaal and B. Mesman

  33. Memory hierarchy • Number of memory levels: 1, 2, 3, 4 • HW  SW controlled level 1 • Cache or Scratchpad memory L1 • Central  Distributed memory • Shared  Distributed memory address space • Intelligent DMA support: Communication Assist • For shared memory: • coherency • consistency • synchronization Platform Design H. Corporaal and B. Mesman

  34. Processor-Memory Performance Gap:(grows 50% / year) Intermezzo:What’s the problem with memory ? Performance µProc: 55%/year 1000 [Patterson] CPU 100 “Moore’s Law” 10 DRAM: 7%/year DRAM 1 1980 1985 1990 1995 2000 Time Memories can be also big power consumers ! Platform Design H. Corporaal and B. Mesman

  35. Multiple levels of memory Architecture concept: Reconfigurable HW blocks Reconfigurable HW blocks CPUs Accelerators CPUs Accelerators Reconfigurable HW blocks Accelerators CPUs Communication network Memory Memory I/O Level 0 Communication network Level 1 Communication network Memory I/O Memory Level N Platform Design H. Corporaal and B. Mesman

  36. Communication models: Shared Memory Shared Memory (read, write) (read, write) Process P2 Process P1 • Coherence problem • Memory consistency issue • Synchronization problem Platform Design H. Corporaal and B. Mesman

  37. Communication models: Shared memory • Shared address space • Communication primitives: • load, store, atomic swap Two varieties: • Physically shared => Symmetric Multi-Processors (SMP) • usually combined with local caching • Physically distributed => Distributed Shared Memory (DSM) Platform Design H. Corporaal and B. Mesman

  38. Processor Processor Processor Processor One or more cache levels One or more cache levels One or more cache levels One or more cache levels SMP: Symmetric Multi-Processor • Memory: centralized with uniform access time (UMA) and bus interconnect, I/O • Examples: Sun Enterprise 6000, SGI Challenge, Intel Main memory I/O System Platform Design H. Corporaal and B. Mesman

  39. Processor Processor Processor Processor Cache Cache Cache Cache Memory Memory Memory Memory DSM: Distributed Shared Memory • Nonuniform access time (NUMA) and scalable interconnect (distributed memory) Interconnection Network Main memory I/O System Platform Design H. Corporaal and B. Mesman

  40. Shared Address Model Summary • Each processor can name every physical location in the machine • Each process can name all data it shares with other processes • Data transfer via load and store • Data size: byte, word, ... or cache blocks • Memory hierarchy model applies: • communication moves data to local proc. cache Platform Design H. Corporaal and B. Mesman

  41. receive send Process P2 Process P1 send receive FiFO Communication models: Message Passing • Communication primitives • e.g., send, receive library calls • Note that MP can be build on top of SM and vice versa Platform Design H. Corporaal and B. Mesman

  42. Message Passing Model • Explicit message send and receive operations • Send specifies local buffer + receiving process on remote computer • Receive specifies sending process on remote computer + local buffer to place data • Typically blocking communication, but may use DMA Message structure Header Data Trailer Platform Design H. Corporaal and B. Mesman

  43. Network interface Network interface Network interface Network interface DMA DMA DMA DMA Message passing communication Processor Processor Processor Processor Cache Cache Cache Cache Memory Memory Memory Memory Interconnection Network Platform Design H. Corporaal and B. Mesman

  44. Communication Models: Comparison • Shared-Memory • Compatibility with well-understood (language) mechanisms • Ease of programming for complex or dynamic communications patterns • Shared-memory applications; sharing of large data structures • Efficient for small items • Supports hardware caching • Messaging Passing • Simpler hardware • Explicit communication • Improved synchronization Platform Design H. Corporaal and B. Mesman

  45. Challenges of parallel processing Q1: can we get linear speedup Suppose we want speedup 80 with 100 processors. What fraction of the original computation can be sequential (i.e. non-parallel)? Q2: how important is communication latency Suppose 0.2 % of all accesses are remote, and require 100 cycles on a processor with base CPI = 0.5 What’s the communication impact? Platform Design H. Corporaal and B. Mesman

  46. Three fundamental issues for shared memory multiprocessors • Coherence, about: Do I see the most recent data? • Consistency, about: When do I see a written value? • e.g. do different processors see writes at the same time (w.r.t. other memory accesses)? • SynchronizationHow to synchronize processes? • how to protect access to shared data? Platform Design H. Corporaal and B. Mesman

  47. CPU CPU cache cache a' 550 a' 100 b' 200 b' 200 memory memory a 100 a 100 b 200 b 440 I/O I/O Coherence problem, in single CPU system CPU cache a' 100 b' 200 memory a 100 b 200 I/O Platform Design H. Corporaal and B. Mesman

  48. Coherence problem, in Multi-Proc system CPU-1 CPU-2 cache cache a' 550 a'' 100 b' 200 b'' 200 memory a 100 b 200 Platform Design H. Corporaal and B. Mesman

  49. What Does Coherency Mean? • Informally: • “Any read must return the most recent write” • Too strict and too difficult to implement • Better: • “Any write must eventually be seen by a read” • All writes are seen in proper order (“serialization”) Platform Design H. Corporaal and B. Mesman

  50. Two rules to ensure coherency • “If P writes x and P1 reads it, P’s write will be seen by P1 if the read and write are sufficiently far apart” • Writes to a single location are serialized: seen in one order • Latest write will be seen • Otherwise could see writes in illogical order (could see older value after a newer value) Platform Design H. Corporaal and B. Mesman

More Related