1 / 27

The Kill Rule for Multicore

The Kill Rule for Multicore. Anant Agarwal MIT and Tilera Corp. Multicore is Moving Fast. Corollary of Moore’s Law Number of cores will double every 18 months. What must change to enable this growth?. Multicore Drivers Suggest Three Directions. Diminishing returns Smaller structures

elvin
Download Presentation

The Kill Rule for Multicore

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Kill Rule for Multicore Anant Agarwal MIT and Tilera Corp.

  2. Multicore is Moving Fast Corollary of Moore’s Law Number of cores will double every 18 months What must change to enable this growth?

  3. Multicore Drivers Suggest Three Directions • Diminishing returns • Smaller structures • Power efficiency • Smaller structures • Slower clocks, voltage scaling • Wire delay • Distributed structures • Multicore programming 1. How we size core resources 2. How we connect the cores 3. How programming will evolve

  4. Core: IPC=1 Area=1 Core: IPC=1.2 Area=1.3 1 IPC = 1 + m x latency Instructions per cycle Core: IPC=1 Area=1 OR Cache Cache Cache Cache Cache Cache Cache Processor Processor Processor Processor Processor Processor Processor Cache Cache Cache 4 cores Small Cache 3 cores Big Cache Processor Processor Processor Chip: IPC=3 Chip: IPC=4 Chip: IPC=3.6 How We Size Core Resources 3 cores Small Cache

  5. “KILL Rule” for Multicore Kill If Less than Linear A resource in a core must be increased in area only if the core’s performance improvement is at least proportional to the core’s area increase Put another way, increase resource size only if for every 1% increase in core area there is at least a 1% increase in core performance Leads to power-efficient multicore design

  6. Multicore Multicore Multicore 14% increase 24% increase 2% increase 4% increase 16% increase Core Core Core Area=1.07 IPC=0.25 Area=1.03 IPC=0.17 Area=1 IPC=0.04 4KB 512B 2KB 325% increase 47% increase 7% increase 3% increase 7% increase Cache 93 cores 100 cores 97 cores Chip IPC=4 Chip IPC=17 Chip IPC=23 Multicore Multicore Multicore Core Core Area=1.63 IPC=0.32 Core Area=1.31 IPC=0.31 Area=1.15 IPC=0.29 16KB 8KB 32KB Chip IPC=25 87 cores 76 cores 61 cores Chip IPC=24 Chip IPC=19 Kill Rule for Cache Size Using Video Codec

  7. Well Beyond Diminishing Returns Madison Itanium2 Cache System L3 Cache Photo courtesy Intel Corp.

  8. 1 IPC = 1 + m x latency (cycles) 1 IPC = = 2 1 + 0.5% x 200 1 IPC = = 2 1 + 2.0% x 50 4GHz Miss penalty in cycles 4x smaller Miss rate can be 4x more 1GHz Implies that cache can be 16x smaller! Slower Clocks Suggest Even Smaller Caches Insight: Maintain constant instructions per cycle (IPC)

  9. Multicore Drivers Suggest Three Directions • Diminishing returns • Smaller structures • Power efficiency • Smaller structures • Slower clocks, voltage scaling • Wire delay • Distributed structures • Multicore programming 1. How we size core resources KILL rule suggests smaller caches for multicore If the clock is slower by x, for constant IPC, the cache can be smaller by x2 KILL rule applies to all multicore resources Issue width: 2-way is probably ideal [Simplefit, TPDS 7/2001] Cache sizes and number of memory hierarchy levels 2. How we connect the cores 3. How programming will evolve

  10. p p p Mesh Multicore Bus Multicore c c c p p p p p p p p p c c c c c c c c c s s s s s s BUS s Ring Multicore p p p p s c c c c s s s s s Interconnect Options Packet routing through switches

  11. p p p Mesh Multicore Bus Multicore c c c p p p p p p p p p c c c c c c c c c s s s s s s BUS s Ring Multicore p p p p s c c c c s s s s s Bisection Bandwidth is Important

  12. p p p Mesh Multicore Bus Multicore c c c p p p p p p p p p c c c c c c c c c s s s s s s BUS s Ring Multicore p p p p s c c c c s s s s s Bandwidth increases as we add more cores Concept of Bisection Bandwidth

  13. Meshes are Power Efficient %Energy Savings(Mesh vs. Bus) Number of Processors Benchmarks

  14. Meshes Offer Simple Layout Example:MIT’s Raw Multicore • 16 cores • Demonstrated in 2002 • 0.18 micron • 425 MHz • IBM SA27E standard cell • 6.8 GOPS www.cag.csail.mit.edu/raw

  15. Tiled c c c c p p p p switch switch switch switch switch switch switch switch switch switch switch switch switch switch switch switch c c c c p p p p c c c c p p p p Tiled Multicore satisfies one additional property p p c c c c Fully Distributed, No Centralized Resources c c p p p p BUS L2 Cache Multicore • Single chip • Multiple processing units • Multiple, independent threads of control, or program counters – MIMD

  16. Mesh based tiled multicore Multicore Drivers Suggest Three Directions • Diminishing returns • Smaller structures • Power efficiency • Smaller structures • Slower clocks, voltage scaling • Wire delay • Distributed structures • Multicore programming 1. How we size core resources 2. How we connect the cores 3. How programming will evolve

  17. Multicore Programming Challenge • “Multicore programming is hard”. Why? • New • Misunderstood- some sequential programs are harder • Current tools are where VLSI design tools where in the mid 80’s • Standards are needed (tools, ecosystems) • This problem will be solved soon. Why? • Multicore is here to stay • Intel webinar: “Think parallel or perish” • Opportunity to create the API foundations • The incentives are there

  18. Old Approaches Fall Short • Pthreads • Intel webinar likens it to the assembly of parallel programming • Data races are hard to analyze • No encapsulation or modularity • But evolutionary, and OK in the interim • DMA with external shared memory • DSP programmers favor DMA • Explicit copying from global shared memory to local store • Wastes pin bandwidth and energy • But, evolutionary, simple, modular and small core memory footprint • MPI • Province of HPC users • Based on sending explicit messages between private memories • High overheads and large core memory footprint But, there is a big new idea staring us in the face

  19. mem mem mem mem Inspiration from ASICs: Streaming mem Stream of data over a hardware FIFO • Streaming is energy efficient and fast • Concept familiar and well developed in hardware design and simulation languages

  20. Interconnect Channel Port1 Port2 Receiver Process Sender Process Streaming is Familiar – Like Sockets • Basis of networking and internet software • Familiar & popular • Modular & scalable • Conceptually simple • Each process can use existing sequential code

  21. Core-to-Core Data Transfer Cheaper than Memory Access • Energy • 32b network transfer over 1mm channel 3pJ • 32KB cache read 50pJ • External access 200pJ • Latency • Reg to reg 5 cycles (RAW) • Cache to cache 50 cycle • DRAM access 200 cycle Data based on 90nm process node

  22. Client-server Broadcast-reduce Streaming Supports Many Models • Pipeline Not great for Blackboard style Shared state But then, there is no one size fits all

  23. Interconnect Channel Port2 Port1 Receiver Process Sender Process connect(<send_proc, Port1>, <receive_proc, Port2>) Put(Port1, Data) Put(Port1, Data) Put(Port1, Data) Put(Port1, Data) Put(Port1, Data) Get(Port2, Data) Get(Port2, Data) Get(Port2, Data) Get(Port2, Data) Get(Port2, Data) Multicore Streaming Can be Way Faster than Sockets • No fundamental overheads for • Unreliable communication • High latency buffering • Hardware heterogeneity • OS heterogeneity • Infrequent setup • Common-case operations are fast and power efficient • Low memory footprint MCA’s CAPI standard

  24. CAPI’s Stream Implementation 1 Process A (E.g., FIR1) Process B (E.g., FIR2) Core 1 Core 2 Multicore Chip I/O register-mapped hardware FIFOs in SOCs

  25. CAPI’s Stream Implementation 2 Cache Cache Process A (E.g., FIR) Process B (E.g., FIR) Core 1 Core 2 On-chip Interconnect Multicore Chip On-chip cache to cache transfers over on-chip interconnect in general multicores

  26. Conclusions • Multicore is here to stay • Evolve core and interconnect • Create multicore programming standards – users are ready • Multicore success requires • Reduction in core cache size • Adoption of mesh based on-chip interconnect • Use of a stream based programming API • Successful solutions will offer evolutionary transition path

More Related