710 likes | 725 Views
Explore the concept of platform-based design for MPSoCs, addressing memory coherence, message passing, and design choices for shared and parallel systems. Discover how platforms enhance design efficiency, optimize performance, and reduce time-to-market, featuring tools like RTOS and libraries. Learn about platform characteristics, parallel architectures, power efficiency through parallelism, and the distinction between homogeneous and heterogeneous systems. Gain insights into the benefits and challenges of parallel processing in modern system design.
E N D
Platform Design Multi-Processor Systems-on-Chip MPSoC TU/e 5kk70 Henk Corporaal Bart Mesman
Overview • What is a platform, and why platform based design? • Why parallel platforms? • A first classification of parallel systems • Design choices for parallel systems • Shared memory systems • Memory Coherency, Consistency, Synchronization, Mutual exlusion • Message passing systems • Further decisions Platform Design H. Corporaal and B. Mesman
Design & Product requirements? • Short Time-to-Market • Reuse / Standards • Short design time • Flexible solution • Reduces design time • Extends product lifetime; remote inspect and debug, … • Scalability • High performance and Low power • Memory bottleneck, Wiring bottleneck • Low cost • High quality, reliability, dependability • RTOS and libs • Good programming environment Platform Design H. Corporaal and B. Mesman
Solution ? • Platforms • Programmable • One or more processor cores • Reconfigurable • Scalable and flexible • Memory hierarchy • Exploit locality • Separate local and global wiring • HW and SW IP reuse • Standardization (on SW and HW-interfaces) • Raising design abstraction level • Reliable • Cheaper • Advanced Design Flow for Platforms Platform Design H. Corporaal and B. Mesman
What is a platform? Definition: A platform is a generic, but domain specific information processing (sub-)system Generic means that it is flexible, containing programmable component(s). Platforms are meant to quickly realize your next system (in a certain domain). Single chip? Platform Design H. Corporaal and B. Mesman
Example Platform: Sanyo Camera Platform Design H. Corporaal and B. Mesman
192Kbyte shared SRAM 8Kb data cache (2-way, 512 lines of 16 bytes) Write buffer (17 elements) 16Kb (2-way) 16Kb (2-way) 8Kb mem (2x 4K) 64Kb dual port (8x 4K x 16b) 96Kb single port (12x 4k x 16b) 32Kb ROM Platform example: TI OMAP Up to 192Mbyte off-chip memory Platform Design H. Corporaal and B. Mesman
Platform and platform design Applications SDT system design technology Design technology Platform PDT platform design technology Enabling technologies Platform Design H. Corporaal and B. Mesman
Why parallel processing • Performance drive • Diminishing returns for exploiting ILP and OLP • Multiple processors fit easily on a chip • Cost effective (just connect existing processors or processor cores) • Low power: parallelism may allow lowering Vdd However: • Parallel programming is hard Platform Design H. Corporaal and B. Mesman
Low power through parallelism • Sequential Processor • Switching capacitance C • Frequency f • Voltage V • P = fCV2 • Parallel Processor (two times the number of units) • Switching capacitance 2C • Frequency f/2 • Voltage V’ < V • P = f/2 2C V’2 =fCV’2 Platform Design H. Corporaal and B. Mesman
Power efficiency: compare 2 examples • Intel Pentium-4 (Northwood) in 0.13 micron technology • 3.0 GHz • 20 pipeline stages • Aggressive buffering to boost clock frequency • 13 nano Joule / instruction • Philips Trimedia “Lite” in 0.13 micron technology • 250 MHz • 8 pipeline stages • Relaxed buffering, focus on instruction parallelism • 0.2 nano Joule / instruction • Trimedia is doing 65x better than Pentium Platform Design H. Corporaal and B. Mesman
Parallel Architecture • Parallel Architecture extends traditional computer architecture with a communication network • abstractions (HW/SW interface) • organizational structure to realize abstraction efficiently Communication Network Processing node Processing node Processing node Processing node Processing node Platform Design H. Corporaal and B. Mesman
Platform characteristics • System level • Processor level • Communication network • Memory system • Tooling Platform Design H. Corporaal and B. Mesman
System level characteristics • Homogeneous Heterogeneous • Granularity of processing elements • Type of supported parallelism: TLP, DLP • Runtime mapping support? Platform Design H. Corporaal and B. Mesman
Homogeneous or Heterogeneous • Homogenous: • replication effect • memory dominated any way • solve realization issuesonce and for all • less flexible • Typically: • data level parallelism • shared memory • dynamic task mapping Platform Design H. Corporaal and B. Mesman
TM TM TM TM TM TM TM memory ARM pixel simd video scale picture improve Example: Philips Wasabi • Homogeneous multiprocessor for media applications • Two-level communication hierarchy • Top: scalable message passingnetwork plus tiles • Tile: shared memory plus processors, accelerators • Fully cache coherent to support data parallelism
Homogeneous or Heterogeneous • Heterogeneous • better fit to application domain • smaller increments • Typically: • task level parallelism • message passing • static task mapping Platform Design H. Corporaal and B. Mesman
MBS VMPG TM3260 TDCS VIP MIPS PR4450 TM3260 QVCP5L MSP MDCS QVCP2L Example: Viper2 • Heterogeneous • Platform based • >60 different cores • Task parallelism • Sync with interrupts • Streaming communication • Semi-static application graph • 50 M transistors • 120nm technology • Powerful, efficient
Homogeneous or Heterogeneous • Middle of the road approach • Flexibile tiles • Fixed tile structure at top level Platform Design H. Corporaal and B. Mesman
DLP Homogenous SIMD / Vector Module level Types of parallelism TLP Heterogenous Multi-threaded / MIMD Program/Thread level Kernel level ILP Heterogenous VLIW / Superscalar/ Dataflow arch. Platform Design H. Corporaal and B. Mesman
Processor level characteristics Processor consists of • Instruction engine (Control Processor, Ifetch unit) • Processing element (PE): Register file, Function unit(s), L1 DMem • Single PE Multiple PEs (as in SIMD) • Single FU/PE Multiple FUs/PE (as in VLIW) • Granularity of PEs, FUs • Specialized Generic • Interruptable, pre-emption support • Multithreading support (fast context switches) • Clustering of PEs; Clustering of FUs • Type of inter PE and inter FU communication network • Others: MMU – virtual memory, ….. Platform Design H. Corporaal and B. Mesman
Generic or Specialized?Intrinsic computational efficiency Platform Design H. Corporaal and B. Mesman
0 M u x 1 I F / I D I D / E X E X / M E M M E M / W B A d d A d d 4 A d d r e s u l t S h i f t l e f t 2 R e a d n o r e g i s t e r 1 i A d d r e s s P C t R e a d c u d a t a 1 r t R e a d s Z e r o n r e g i s t e r 2 I I n s t r u c t i o n R e g i s t e r s A L U R e a d A L U m e m o r y 0 R e a d W r i t e A d d r e s s d a t a 2 1 r e s u l t d a t a r e g i s t e r M M u D a t a u W r i t e x m e m o r y x d a t a 1 0 W r i t e d a t a 1 6 3 2 S i g n e x t e n d General processor organization PE: processing engine Instruction fetch - Control FU Platform Design H. Corporaal and B. Mesman
DMem DMem DMem DMem DMem DMem DMem DMem DMem RF RF RF RF RF RF RF RF RF FU FU FU FU FU FU FU FU FU (Linear) SIMD Architecture Control Processor IMem PE1 PEn • To be added: • inter PE communication • communication from PEs to Control Processor • Input and Output Platform Design H. Corporaal and B. Mesman
Communication network • Bus (single all2all connection) Crossbar NoC with point-to-point connections • Topology, Router degree • Routing • path, path control, collision resolvement, network support, deadlock handling, livelock handling • virtual layer support • flow control and buffering • error handling • Inter-chip network support • Guarantees • TDMA • GT BE traffic • etc, etc. Platform Design H. Corporaal and B. Mesman
Comm. Network: Performance metrics • Network Bandwidth • Need high bandwidth in communication • How does it scale with number of nodes? • Communication Latency • Affects performance, since processor may have to wait • Affects ease of programming, since it requires more thought to overlap communication and computation • Latency Hiding • Global memory access can take hundreds of cycles • How can a mechanism help hide latency? • Examples: • overlap message send with computation, • prefetch data, • switch to other tasks Platform Design H. Corporaal and B. Mesman
How good is your network? Topology determines: • Degree = number of links from a node • Diameter = max number of links crossed between nodes • Average distance = number of links to random destination • Bisection = minimum number of links that separate the network into two halves • Bisection bandwidth = link bandwidth x bisection Platform Design H. Corporaal and B. Mesman
Metrics for common topologies Type Degree Diameter Ave Dist Bisection 1D mesh 2 N-1 N/3 1 2D mesh 4 2(N1/2 - 1) 2N1/2 / 3 N1/2 3D mesh 6 3(N1/3 - 1) 3N1/3 / 3 N2/3 nD mesh 2n n(N1/n - 1) nN1/n / 3 N(n-1) / n Ring 2 N/2 N/4 2 2D torus 4 N1/2 N1/2 / 2 2N1/2 Hypercube Log2N n=Log2N n/2 N/2 2D Tree 3 2Log2N ~2Log2 N 1 Crossbar N-1 1 1 N2/2 N = number of nodes, n = dimension Platform Design H. Corporaal and B. Mesman
More topology metrics Hypercube Grid/Mesh Torus Assume 64 nodes: Platform Design H. Corporaal and B. Mesman
How to make a bigger butterfly network? N/2 Butterfly ° ° ° N/2 Butterfly ° ° ° Multi-stage network: Butterfly or Omega • All paths equal length • Unique path from any input to any output • Try to avoid conflicts !! 8 x 8 butterfly switch Platform Design H. Corporaal and B. Mesman
Multistage Fat Tree • A multistage fat tree (CM-5) avoids congestion at the root node • Randomly assign packets to different paths on way up to spread the load • Increase degree near root, decrease congestion Platform Design H. Corporaal and B. Mesman
What did architects design in the 90ties?Old (off-chip) MP Networks Name Number Topology Bits Clock Link Bis. BW Year nCube/ten 1-1024 10-cube 1 10 MHz 1.2 640 1987 iPSC/2 16-128 7-cube 1 16 MHz 2 345 1988 MP-1216 32-512 2D grid 1 25 MHz 3 1,300 1989 Delta 540 2D grid 16 40 MHz 40 640 1991 CM-5 32-2048 fat tree 4 40 MHz 20 10,240 1991 CS-2 32-1024 fat tree 8 70 MHz 50 50,000 1992 Paragon 4-1024 2D grid 16 100 MHz 200 6,400 1992 T3D 16-1024 3D Torus 16 150 MHz 300 19,200 1993 MBytes/s No standard topology! However, for on-chip: mesh and torus are in favor ! Platform Design H. Corporaal and B. Mesman
Memory hierarchy • Number of memory levels: 1, 2, 3, 4 • HW SW controlled level 1 • Cache or Scratchpad memory L1 • Central Distributed memory • Shared Distributed memory address space • Intelligent DMA support: Communication Assist • For shared memory: • coherency • consistency • synchronization Platform Design H. Corporaal and B. Mesman
Processor-Memory Performance Gap:(grows 50% / year) Intermezzo:What’s the problem with memory ? Performance µProc: 55%/year 1000 [Patterson] CPU 100 “Moore’s Law” 10 DRAM: 7%/year DRAM 1 1980 1985 1990 1995 2000 Time Memories can be also big power consumers ! Platform Design H. Corporaal and B. Mesman
Multiple levels of memory Architecture concept: Reconfigurable HW blocks Reconfigurable HW blocks CPUs Accelerators CPUs Accelerators Reconfigurable HW blocks Accelerators CPUs Communication network Memory Memory I/O Level 0 Communication network Level 1 Communication network Memory I/O Memory Level N Platform Design H. Corporaal and B. Mesman
Communication models: Shared Memory Shared Memory (read, write) (read, write) Process P2 Process P1 • Coherence problem • Memory consistency issue • Synchronization problem Platform Design H. Corporaal and B. Mesman
Communication models: Shared memory • Shared address space • Communication primitives: • load, store, atomic swap Two varieties: • Physically shared => Symmetric Multi-Processors (SMP) • usually combined with local caching • Physically distributed => Distributed Shared Memory (DSM) Platform Design H. Corporaal and B. Mesman
Processor Processor Processor Processor One or more cache levels One or more cache levels One or more cache levels One or more cache levels SMP: Symmetric Multi-Processor • Memory: centralized with uniform access time (UMA) and bus interconnect, I/O • Examples: Sun Enterprise 6000, SGI Challenge, Intel Main memory I/O System Platform Design H. Corporaal and B. Mesman
Processor Processor Processor Processor Cache Cache Cache Cache Memory Memory Memory Memory DSM: Distributed Shared Memory • Nonuniform access time (NUMA) and scalable interconnect (distributed memory) Interconnection Network Main memory I/O System Platform Design H. Corporaal and B. Mesman
Shared Address Model Summary • Each processor can name every physical location in the machine • Each process can name all data it shares with other processes • Data transfer via load and store • Data size: byte, word, ... or cache blocks • Memory hierarchy model applies: • communication moves data to local proc. cache Platform Design H. Corporaal and B. Mesman
receive send Process P2 Process P1 send receive FiFO Communication models: Message Passing • Communication primitives • e.g., send, receive library calls • Note that MP can be build on top of SM and vice versa Platform Design H. Corporaal and B. Mesman
Message Passing Model • Explicit message send and receive operations • Send specifies local buffer + receiving process on remote computer • Receive specifies sending process on remote computer + local buffer to place data • Typically blocking communication, but may use DMA Message structure Header Data Trailer Platform Design H. Corporaal and B. Mesman
Network interface Network interface Network interface Network interface DMA DMA DMA DMA Message passing communication Processor Processor Processor Processor Cache Cache Cache Cache Memory Memory Memory Memory Interconnection Network Platform Design H. Corporaal and B. Mesman
Communication Models: Comparison • Shared-Memory • Compatibility with well-understood (language) mechanisms • Ease of programming for complex or dynamic communications patterns • Shared-memory applications; sharing of large data structures • Efficient for small items • Supports hardware caching • Messaging Passing • Simpler hardware • Explicit communication • Improved synchronization Platform Design H. Corporaal and B. Mesman
Challenges of parallel processing Q1: can we get linear speedup Suppose we want speedup 80 with 100 processors. What fraction of the original computation can be sequential (i.e. non-parallel)? Q2: how important is communication latency Suppose 0.2 % of all accesses are remote, and require 100 cycles on a processor with base CPI = 0.5 What’s the communication impact? Platform Design H. Corporaal and B. Mesman
Three fundamental issues for shared memory multiprocessors • Coherence, about: Do I see the most recent data? • Consistency, about: When do I see a written value? • e.g. do different processors see writes at the same time (w.r.t. other memory accesses)? • SynchronizationHow to synchronize processes? • how to protect access to shared data? Platform Design H. Corporaal and B. Mesman
CPU CPU cache cache a' 550 a' 100 b' 200 b' 200 memory memory a 100 a 100 b 200 b 440 I/O I/O Coherence problem, in single CPU system CPU cache a' 100 b' 200 memory a 100 b 200 I/O Platform Design H. Corporaal and B. Mesman
Coherence problem, in Multi-Proc system CPU-1 CPU-2 cache cache a' 550 a'' 100 b' 200 b'' 200 memory a 100 b 200 Platform Design H. Corporaal and B. Mesman
What Does Coherency Mean? • Informally: • “Any read must return the most recent write” • Too strict and too difficult to implement • Better: • “Any write must eventually be seen by a read” • All writes are seen in proper order (“serialization”) Platform Design H. Corporaal and B. Mesman
Two rules to ensure coherency • “If P writes x and P1 reads it, P’s write will be seen by P1 if the read and write are sufficiently far apart” • Writes to a single location are serialized: seen in one order • Latest write will be seen • Otherwise could see writes in illogical order (could see older value after a newer value) Platform Design H. Corporaal and B. Mesman