1 / 34

ARM Cortex-A9 MPCore ™ processor

ARM Cortex-A9 MPCore ™ processor. Presented by- Chris Cai (xiaocai2) Rehana Tabassum (tabassu2) Sam Mussmann (mussmnn2). Background.

saima
Download Presentation

ARM Cortex-A9 MPCore ™ processor

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ARM Cortex-A9 MPCore™ processor • Presented by- • Chris Cai (xiaocai2) • Rehana Tabassum (tabassu2) • Sam Mussmann(mussmnn2)

  2. Background “The architectural simplicity of ARM processors leads to very small implementations, and small implementations mean devices can have very low power consumption. Implementation size, performance, and very low power consumptionare key attributes of the ARM architecture.” ARM Architecture Reference Manual ARMv7-A edition

  3. Background (2) • ARM is RISC • Uniform register file • Load/store architecture • Simple addressing

  4. Background (3) • The ARM Cortex-A9 processor is the high performance choice in a family of low power, cost-sensitive devices. • The Cortex-A9 microarchitecture is delivered either as a Cortex-A9 single core processor or a scalable multicore processor: the Cortex-A9 MPCore ™ processor

  5. Where is it used? • Examples: • Apple A5 (iPhone 4S, iPad 2, iPad mini) http://en.wikipedia.org/wiki/ARM_Cortex-A9_MPCore#Implementations http://en.wikipedia.org/wiki/Iphone_4s

  6. Where is it used? (2) • Examples: • NVIDIA Tegra 2 (Motorola Xoom, Droid X2) http://en.wikipedia.org/wiki/ARM_Cortex-A9_MPCore#Implementations http://en.wikipedia.org/wiki/Motorola_Xoom

  7. Where is it used? (3) • Examples: • PlayStation Vita http://en.wikipedia.org/wiki/ARM_Cortex-A9_MPCore#Implementations http://en.wikipedia.org/wiki/PlayStation_Vita

  8. What are its specs? • The Cortex A9 core: • Gives 2.50 DMIPS/MHz/core (Dhrystone MIPS) • Generally clocked between 800MHz and 2GHz • Possible to run > 1GHz and < 250mW http://arm.com/products/processors/cortex-a/cortex-a9.php?tab=Specifications http://www.linuxfordevices.com/c/a/News/ARM-spins-multicoreenabled-Cortex-core/

  9. Presentation Overview • Micro-architecture • Memory System • Multi-core

  10. Microarchitecture Overview • Variable length, out of order, superscalar pipeline • Two instructions are fetched in one cycle • Issue up to 4 instructions per cycle into: • Primary data processing pipeline • Secondary data processing pipeline • Load-store pipeline • Compute engine (FPU/NEON) pipeline • Speculative execution • Supporting virtual renaming of physical registers and removing pipelines stalls due to data dependencies

  11. CortexA9 Microarchitecture Execute Rename Issue Writeback Decode Instruction Fetch Memory www.arm.com/files/pdf/armcortexa-9processors.pdf

  12. Instruction Fetch • Instruction cache size: 16KB, 32KB, or 64KB • Superscalar pipeline: fetching two instructions at once • Branch Prediction: • Global History Buffer: 1K ~ 16K entries • Branch-Target Address Cache: 512 ~ 4K entries • Return stack of 4 x 32 bits • Fast-loop mode: instruction loop that are smaller than 64 bytes often complete without additional instruction cache accesses

  13. Instruction Decode • Super Scalar Decoder • Capable of decoding two full instructions per cycle

  14. Rename • Register Renaming • Resolving data dependencies and unroll small loops by hardware

  15. Issue • Issue can be fed maximum of 2 instructions per cycle • Issue can dispatch up to 4 instructions per cycle • Out of order selection of instructions from queue

  16. Execute • Variable length Executing Stage (1 ~ 3 cycles) • Most Instructions finish within 1 cycle • Instruction which folds shifts and rotates can take 3 cycles • ADD r0, r1, r2 (1 cycle) • ADD r0, r1, r2 LSL #2 (2 cycle) • Corresponds to a = b + (c << 2); • ADD r0, r1, r2 LSL r3 (3 cycle) • Corresponds to a = b + (c << d);

  17. Execute (2) • NEON Media Processing Engine • NEON technology supports instructions targeted primarily at audio, video, 3D graphics, image and speech processing. http://www.arm.com/files/pdf/AT_-_NEON_for_Multimedia_Applications.pdf

  18. Execute (3) • What is NEON? • NEON is a wide SIMD data processing architecture • 32 registers, 64 bit wide or 16 registers, 128 bit wide • NEON instructions perform “Packed SIMD” processing • Registers can be considered as “vector” of same data type • Instructions perform the same operation in all lanes http://www.arm.com/files/pdf/AT_-_NEON_for_Multimedia_Applications.pdf

  19. Execute (4) • NEON Media Processing Engine supports vector computations on: • half-precision (16bit), single-precision (32bit), double-precision (64bit) floating-point numbers • 8, 16, 32 and 64 bit signed and unsigned integers • Supported Operations Include: • addition, subtraction, multiplication • maximum or minimum of a vector of operands • Inverse square-root approximation (y = x^-(1/2)) • many more

  20. Memory • dependent load-store instructions forwarded for resolution within memory system • 2-level TLB structure • micro TLB • 32 entries on data side and 32 or 64 entries on instruction side • to reduce power consumed in translation and protection look-ups • main TLB http://infocenter.arm.com/help/topic/com.arm.doc.ddi0388i/DDI0388I_cortex_a9_r4p1_trm.pdf

  21. Memory (2) • Data prefetcher • monitor cache line requests by processor and cache misses to determine how much data to prefetch • can prefetchup to 8 independent data streams • prefetch and allocate data in the L1 data cache, as long as it keeps hitting in the prefetched cache line • When stop prefetching?

  22. Memory Hierarchy Cortex A9 MPcore CPU CPU CPU CPU Instruction Cache Data Cache Data Cache Instruction Cache Instruction Cache Data Cache Instruction Cache Data Cache Snoop Control Unit (SCU) Accelerator Coherence Port L2 Cache Main Memory http://infocenter.arm.com/help/topic/com.arm.doc.ddi0407i/DDI0407I_cortex_a9_mpcore_r4p1_trm.pdf

  23. L1 caches Cortex A9 MPcore CPU CPU CPU CPU • Non-unified • 32 bytes line length • can be disabled independently • 16, 32 or 64KB • 4 - way associative • support for Security Extensions • I cache: VIPT • D cache: PIPT • reduce number of caches flushes and refills and save energy D$ I$ I$ D$ D$ I$ D$ I$ SCU ACP AXI RW 64-bit bus AXI RW 64-bit bus L2 Cache Main Memory http://infocenter.arm.com/help/topic/com.arm.doc.ddi0407i/DDI0407I_cortex_a9_mpcore_r4p1_trm.pdf

  24. L2 cache Cortex A9 MPcore CPU CPU CPU CPU • shared, unified • Off-chip • 128KB to 8MB • 4 to 16-way associative D$ I$ I$ D$ D$ I$ D$ I$ SCU ACP AXI RW 64-bit bus AXI RW 64-bit bus L2 Cache Main Memory http://infocenter.arm.com/help/topic/com.arm.doc.ddi0407i/DDI0407I_cortex_a9_mpcore_r4p1_trm.pdf

  25. Snoop Control Unit Cortex A9 MPcore • Integral part of cache memory systems • Connects processors to memory system through AXI interfaces CPU CPU CPU CPU D$ I$ I$ D$ D$ I$ D$ I$ SCU ACP AXI RW 64-bit bus AXI RW 64-bit bus L2 Cache Main Memory http://infocenter.arm.com/help/topic/com.arm.doc.ddi0407i/DDI0407I_cortex_a9_mpcore_r4p1_trm.pdf

  26. Snoop Control Unit (1) • SCU functions : • maintain data cache coherency • initiate L2 memory accesses • arbitrate between processors’ simultaneous request for L2 accesses • manages accesses from ACP • does not support instruction cache coherency http://infocenter.arm.com/help/topic/com.arm.doc.ddi0407i/DDI0407I_cortex_a9_mpcore_r4p1_trm.pdf

  27. Accelerator Coherence Port • optional AXI 64-bit slave port • allows to connect to non-cached system mastering peripherals and accelerators • —DMA engine or cryptographic accelerator • SCU enforces memory coherency http://www.arm.com/files/pdf/ARMCortexA-9Processors.pdf

  28. Multi-Core http://www.arm.com/files/pdf/ARMCortexA-9Processors.pdf

  29. Cache Coherence – MESI http://en.wikipedia.org/wiki/MESI_protocol

  30. Cache Coherence – MESI (2) • ARM MPCore has optimizations to MESI: • Duplicated tag RAMs • All done in the Snoop Control Unit System Level Benchmarking Analysis of the Cortex A9 MPCore  Roberto Mijat, ARM Connected Community Technical Symposium, 2009

  31. Cache Coherence – MESI (2) • ARM MPCore has optimizations to MESI: • Duplicated tag RAMs • Cache-2-Cache transfer • All done in the Snoop Control Unit System Level Benchmarking Analysis of the Cortex A9 MPCore  Roberto Mijat, ARM Connected Community Technical Symposium, 2009

  32. Cache Coherence – MESI (2) • ARM MPCore has optimizations to MESI: • Duplicated tag RAMs • Cache-2-Cache transfer • Migratory Lines • All done in the Snoop Control Unit System Level Benchmarking Analysis of the Cortex A9 MPCore  Roberto Mijat, ARM Connected Community Technical Symposium, 2009

  33. Generalized Interrupt Control • Which core services interrupts? • GIC gives the programmer control • Centralizes interrupts, then dispatches to individual core(s) System Level Benchmarking Analysis of the Cortex A9 MPCore  Roberto Mijat, ARM Connected Community Technical Symposium, 2009

More Related