620 likes | 747 Views
This presentation explores the evolution and challenges of manycore architectures from a hardware and software standpoint. It discusses the implications of Moore's Law, addressing the limitations posed by walls such as instruction-level parallelism (ILP), frequency, power, and memory. The focus includes how advanced multicore processors, like Intel's Xeon and AMD's Opteron, navigate these challenges through innovations in cache architectures and interconnection fabrics. Insights into modern media DSP technology and task-based programming models highlight the future of computational efficiency in video processing and beyond.
E N D
Manycores – From hardware prospective to software Presenter: D96943001 電子所 陳泓輝
Why Moore’s Law is die • He is not CEO anymore!! • Walls => ILP, Frequency, Power, Memory walls
ILP – more cost less return • ILP: instruction level parallelism • OOO: out of order execution of microcodes
Frequency wall • FO4 delay metric: delay of a inverter with 4 fan-in with ¼ size and it drives another inverter 4x size • Freq ↑ => Some OP cycle counts ↑ Saturated!
Memory wall • External access penalty is increasing(the gap) • Solution => enlarge cache • Cache decide the performance and the price
The power wall • High power might imply • Thermal run away of device behavior • Larger current => electronic migration => issue of the reliability of the metal connection • Hit packaging heat limitation • Change to high cost packaging • Cooling noise!! • Form factor
The great wall…… Moore’s Law CMOS Multicore Manycore
Historical - Intel 2007 Xeon • Dual on chip memory controller => fcpu > 2*fmem • Point-to-point interconnection => fabrics • Multiple communication activities (c.f. “bus” => one activity)
AMD – Opteron(Shanghai) • Much the same as Intel Xeon • Shared L3 cache among 2 cores
Game consoles • XBox360 => Triple core • PS3 => Cell, 8+1 cores Homogeneous Heterogeneous Power PC wins!
State-of-art multicoreDSP chips TI TNETV3020 Freescale 8156 Homogeneous Heterogeneous
State-of-art multicoreDSP chips picoChip PC205 Tilera TILE64 Heterogeneous Homogeneous, Mesh
State-of-art multicorex86 chips • 24 “tiles” with two IA cores per tile • A 24-router mesh network with 256 GB/s bisection bandwidth • 4 integrated DDR3 memory controllers • Hardware support for message-passing !! Intel Single-chip Cloud Computer 1GHz Pentium
GPGPU - OpenCL Official LOGO
Special case: multicorevideo processor • Characteristics of video applications in consumer electronics • High computational capability • Low hardware cost • Low power consumption • A General Solution • Fixed-function logic designed • Challenges • Multiple video decoding standards • Updating video decoding standards • Ill-posed video processing algorithms • Product requirements are diverse and mutually exclusive
mediaDSPtechnology Nickname: accelerator • Broadcom: mediaDSP technology • Heterogeneous (programmable and fixed functions units) • A task-based programming model • A uniform approach for managing tasks executing on different types of programmable and fixed-function processing elements • A platform, easily extendable to support a range of applications • Easily to be customized for special purpose • Successful stories • SD MPEG Video encoder including scaling and noise reduction • Frame-Rate-Conversation Video Processing for FHD@60Hz /120Hz videos
Classes of video processing • Highly parallelizable operations for fixed-point data and no floating point • A processor with SIMD data path engine • Ad-hoc computation and decision making, which are operating on smaller sets of data produced by the parallelizable processes • A general processor such as RISC • Data movement and formatting on multidimensional pixels • Bit serial processing for entropy decoding and encoding => dedicate hardware do this job very efficiently
Task-based programming model • Programmers’ duties as follows: • Partition a sequential algorithm into a set of parallelizable tasks and then efficiently map it to the massively parallel architecture • A task has a definite initiation time • A task runs until completion with no interruption and no further synchronization with other task • Understand hardware architecture and limitation • Shared memory (instead of FIFO mode) • Buffer size must be enough for a data unit • Interconnect bandwidth must be enough • Computational power must be enough for real time
(IP) Platform-based architecture • Task-oriented engine (TOE) • A programmable DSP or a fixed function unit • Task control unit (TCU) • A RISC maintains a queue of tasks and synchronous with other TCU/TOEs • To maximize the utilization of TOEs • Control engine • Shared memory • Communication fabric
Memory architecture • All TOEs use software-managed DMA rather than caches for their local storage • 6D addressing (x,y,t,Y,U,V) and the chunking of blocks into smaller subblocks. • No {pre-fetching, early load scheduling, cache, speculative execution, multithreading …} • Memory hierarchy • L1 - Processor Instruction and Data Memory • L2 - On-chip Shared Memory • L3 - Off-chip
Broadcom BCM35421 chip [1/2] • Do motion-compensated frame-rate conversion • Double frame rate from FHD@60fps to FHD@120fps (to conquer motion blur) • 24fps 60fps (de-judder)
Broadcom BCM35421 chip [2/2] • 65nm CMOS process • mediaDSP runs at 400 MHz • 106 Million transistors • Two Teraops of peak integer performance
Performance of DSPs for applications • DSP becomes useful when it can perform a minimum of 100 instructions per sample period • 68% DSP were shipped for mobile handsets and base stations in 2008 Several K cycles for processing a input sample
Multiple elements • Increase in performance: • multiple elements > higher performance single elements
Go deeper –TI’s multicore Multicore Programming Guide
Mapping application to mutilcore • Know the processing model option • Identify all the tasks • Task partition into many small ones • Familiar with Inter-task communication/data flow • Combination/aggregation • Mapping • Memory hierarchy => L1/L2/L3, private/shared memory, external memory channel numbers/capability • DMA • Special purpose hardware!! • FFT, Viterb, reed solomon, AES codec, Entropy codec
Parallel processing model Master/Slave model Data flow • Very successful in communication system • Router • Base station
Data movement • Shared memory • Dedicated memory • Transitional memory => ownership change, content not copy
Notification [1/4] • Direct signaling • Create event to other core’s local interrupt controller • Other core polling local status • Or the local interrupt controller convert this event to real interrupt
Notification [2/4] • Indirect signaling • Not directly controlled by software
Notification [3/4] • Atomic arbitration • Hardware semaphore/mutex • Semaphore => allow limited multiple access => example: multi-port SRAM/external DDR memory • Mutex => allow one access only • Use software semaphore instead if resource only shared between processes only executed in one core • Overhead of hardware semaphore is not small • Its only a facility for software usage, hardware only guarantee atomic operation, locked content is not protected • Cost, performance consideration
Notification [4/4] • Left diagram is mutex • Just like the software counterpart
Data transfer engines • DMA => System DMA, local DMA belongs to a core • Ethernet • Up to 32 MAC address • RapidIO • Implemented with ultra fast serial IO physical layer • Maybe multiple serial IO links uni/bi-directional • Example • USB 2.0 => 480Mbit/sec USB 3.0 => 5Gbit/sec • Serial ATA • 1.0, Gen 1 => 1.5Gbit/sec; 2.0, Gen 2 => 3 Gbit/sec • 3.0, Gen 3 => 6 Gbit/sec
High speed serial link USB SATA
Memory management • Devices do not support automated cache coherency among cores because of the power consumption involved and the latency overhead introduced Switched central resource fabric
Highlights [1/3] • Portion of the cache could be configured to as memory mapped SRAM • Transparent cache => visible • Address aliasing => masking MSByte • For core 0: 0x10800000 == 0x00800000 • For core 1: 0x11800000 == 0x00800000 • For core 2: 0x12800000 == 0x00800000 • Special register DNUM for dynamic pointer address update => Implicit Write common rom code still assess core’s private area Explicit Each core has it DNUM
Highlight [2/3] • The only guaranteed coherency by hardware • L1D L2 (core-locally) • L1D L2 SL2 (if as memory mapped SRAM) (core-locally) • Equal access to the external DDR2 SDRAM through SCR L1P L1D L1P L1D This may be the bottleneck for certain application
Highlight [3/3] • If any portion of the L1s is configured as memory-mapped SRAM, there is a small paging engine built into the core (IDMA) that can be used to transfer linear blocks of memory between L1 and L2 in the background of CPU operation • Paging engine => MMU • IDMA may also be used to perform bulk peripheral configuration register access
DSP code and data image • Image types • Single image • Multiple image • Multiple image with shared code and data • Complex linking scheme should be used • Device boot • Tool ToolTool
Debugging • Cuda => XXOO↑↑↓↓←→←→BA
TI’s offer • Hardware emulation => ICE, JTAG • Basically, not intrusive • Software instrumentation • Patching original codes to enable same ability=> this time, “Trace Logs” • Basically, intrusive • Type of Trace Logs • API call log, Statistics log, DMA transaction log, Event log, Customer data log
More on logs • Information stores in memory pull back to host by path through hardware emulation • Provide tool to correlate all the logs • Display them with an organized manner • Log example:
Go deeper –Freescale’smanycore Embedded Multicore: An Introduction
Why manycore? • Freescale MPC8641 • Single core => freq x 1.5 => power x 2 • Dual core => freq x 1.5 => power x 1.3 Bug in this Fig.
SMP + AMP + Sharing • Manycore enables multiple OS concurrently running • Memory sharing => MMU • Interface/peripheral sharing => hypervisor • Virtualization is good for legacy support
Manycore example [1/2] 2 2 2 2 4 2 3 4 3 1 4