Andes Embedded Processor AndesCore TM N1213-S

Andes Embedded ProcessorAndesCoreTM N1213-S www.andestech.com

Agenda • Computer architecture • AndesCoreTM • Pipeline • Cache • MMU • DMA • BIU • Interruption • AICE

Computer architecture taxonomy • von Neumann architecture

Computer architecture taxonomy (1/3) • von Neumann architecture • Features of each: • Execution in multiple cycles • Serial fetch instructions & data • Single memory structure • Can get data/program mixed • Data/instructions same size • Examples, von Neumann: PCs (Intel 80x86/Pentium, Motorola 68000, Mot 68xx uC families

Computer architecture taxonomy (2/3) • Harvard architecture address CPU data memory PC data address program memory data

Computer architecture taxonomy (3/3) • Harvard architecture • Features of each: Execution in 1 cycle • Parallel fetch instructions & data • More Complex H/W • Instructions and data always separate • Different code/data path widths • Harvard: 8051, Microchip PIC families, Atmel AVR, AndeScore

Architectures: CISC vs. RISC (1/2) • CISC - Complex Instruction Set Computers • Emphasis on hardware • Includes multi-clock complex instructions • Memory-to-memory • Sophisticated arithmetic (multiply, divide, trigonometry etc.). • Special instructions are added to optimize performance with particular compilers.

Architectures: CISC vs. RISC (2/2) • RISC - Reduced Instruction Set Computers • A very small set of primitive instructions • Fixed instruction format • Emphasis on software • All instructions execute in one cycle (Fast!). • Register to register (except Load/Store instructions) • Pipeline architecture

Single-, Dual-, Multi-, Many- Cores • Single-core: • Most popular today. • Dual-core, multi-core, many-core: • Forms of multiprocessors in a single chip • Small-scale multiprocessors (2-4 cores): • Utilize task-level parallelism. • Task example: audio decode, video decode, display control, network packet handling. • Large-scale multiprocessors (>32 cores): • nVidia’s graphics chip: >128 core • Sun’s server chips: 64 threads

AndesCoreTM

AndesCore™ N1213-S • CPU Core • 32bit CPU • Single issue with 8-stage pipeline • Andestar™ ISA with 16-/32-bit intermixable instructions to reduce code size • Dynamic branch prediction to reduce branch penalties • 32/64/128/256 BTB • Configurability for customers • Configuration options for power, performance and area requirements

AndesCore™ N1213-S • MMU • fully-associative iTLB/dTLB: 4 or 8 entries • 4-way set-associative main TLB: 32/64/128 entries • Two groups of pages size support: (4K,1M) and (8K,1M) • Locking support for TLB • I & D cache • Virtual index and physical tag (for faster context switching) • Cache size: 8KB/16KB/32KB/64KB • Cache line size: 16B/32B • 2/4-way set associative • I Cache locking support

AndesCore™ N1213-S • I&D Localmemory • wide range support for internal /external local memory • 4KB~1024KB • Provide fixed access latencies for internal local memory • Double buffer mode for D local memory • Optional external local memory interface • Bus • Synchronous/Asynchronous AHB • 1 or 2 port configuration • Synchronous HSMP • AXI like • 1 or 2 port configuration

AndesCore™ N1213-S • For performance • Improved memory accesses: • 1D/2D DMA, load/store multiple • Efficient synchronization without locking the whole bus • Load lock, store conditional instructions • Vectored interrupt to improve real-time performance • 6 interrupt signals • MMU • Optional HW page table walker • TLB management instructions • For flexibility • Memory-mapped IO space • PC-relative jumps for position independent code • JTAG-based debug support • Optional embedded program trace interface • Performance monitors for performance tuning • Bi-endian modes to support flexible data input

Pipeline

AndesCore 8-stage pipeline

Instruction Fetch Stage • F1 – Instruction Fetch First • Instruction Tag/Data Arrays • ITLB Address Translation • Branch Target Buffer Prediction • F2 – Instruction Fetch Second • Instruction Cache Hit Detection • Cache Way Selection • Instruction Alignment IF1 IF2 ID RF AG DA1 DA2 WB EX MAC1 MAC2

Instruction Issue Stage • I1 – Instruction Issue First / Instruction Decode • 32/16-Bit Instruction Decode • Return Address Stack prediction • I2 – Instruction Issue Second / Register File Access • Instruction Issue Logic • Register File Access IF1 IF2 ID RF AG DA1 DA2 WB EX MAC1 MAC2

Execution Stage • E1 – Instruction Execute First / Address Generation / MAC First • Data Access Address Generation • Multiply Operation (if MAC presents) • E2 –Instruction Execute Second / Data Access First / MAC Second / ALU Execute • ALU • Branch/Jump/Return Resolution • Data Tag/Data arrays • DTLB address translation • Accumulation Operation (if MAC presents) • E3 –Instruction Execute Third / Data Access Second • Data Cache Hit Detection • Cache Way Selection • Data Alignment IF1 IF2 ID RF AG DA1 DA2 WB EX MAC1 MAC2

Write Back Stage • E4 –Instruction Execute Fourth / Write Back • Interruption Resolution • Instruction Retire • Register File Write Back IF1 IF2 ID RF AG DA1 DA2 WB EX MAC1 MAC2

Branch Prediction Overview • Why is branch prediction required? • A deep pipeline is required for high speed • Why dynamic branch prediction? • Static branch prediction • Dynamic branch prediction

Branch Prediction Unit • Branch Target Buffer (BTB) • 128 entries of 2-bit saturating counters • 128 entries, 32-bit predicted PC and 26-bit address tag • Return Address Stack (RAS) • Four entries • BTB and RAS updated by committing branches/jumps

BTB Instruction Prediction • BTB predictions are performed based on the previous PC instead of the actual instruction decoding information, BTB may make the following two mistakes • Wrongly predicts the non-branch/jump instructions as branch/jump instructions • Wrongly predicts the instruction boundary (32-bit -> 16-bit) • If these cases are detected, IFU will trigger a BTB instruction misprediction in the I1 stage and re-start the program sequence from the recovered PC. There will be a 2-cycle penalty introduced here

RAS Prediction • When return instructions present in the instruction sequence, RAS predictions are performed and the fetch sequence is changed to the predicted PC. • Since the RAS prediction is performed in the I1 stage. There will be a 2-cycle penalty in the case of return instructions since the sequential fetches in between will not be used.

Branch Miss-Prediction • In N12 processor core, the resolution of the branch/return instructions is performed by the ALU in the E2 stage and will be used by the IFU in the next (F1) stage. In this case, the misprediction penalty will be 5 cycles.

Cache

N1213-S Block diagram

Cache and CPU address data cache main memory CPU cache controller address data data

Multiple levels of cache L2 cache CPU L1 cache

Cache data flow I-Cache I Cache refill I Fetches Uncached Instruction/data CPU Ext Memory Uncached write/write-through Write back Load & Store D-Cache D-Cache refill

Cache operation • Many main memory locations are mapped onto one cache entry. • May have caches for: • instructions; • data; • data + instructions (unified).

Replacement policy • Replacement policy: strategy for choosing which cache entry to throw out to make room for a new memory location. • Two popular strategies: • Random. • Least-recently used (LRU).

Write operations • Write-through: immediately copy write to main memory. • Write-back: write to main memory only when location is removed from cache.

Improving Cache Performance • Goal: reduce the Average Memory Access Time (AMAT) • AMAT = Hit Time + Miss Rate * Miss Penalty • Approaches • Reduce Hit Time • Reduce or Miss Penalty • Reduce Miss Rate • Notes • There may be conflicting goals • Keep track of clock cycle time, area, and power consumption

Tuning Cache Parameters • Size: • Must be large enough to fit working set (temporal locality) • If too big, then hit time degrades • Associativity • Need large to avoid conflicts, but 4-8 way is as good a FA • If too big, then hit time degrades • Block • Need large to exploit spatial locality & reduce tag overhead • If too large, few blocks ⇒ higher misses & miss penalty Configurable architecture allows designers to makethe best performance/cost trade-offs

Memory Management Units (MMU)

MMU Functionality • Memory management unit (MMU) translates addresses logical address memory management unit physical address CPU

M-TLB Tag M-TLB data M-TLB Tag M-TLB data MMU Architecture M-TLB entry index IFU LSU N(=32) sets k(=4) ways =128-entry 4/8 I-uTLB 4/8 D-uTLB 6 4 Set number 0 5 Way number Log2(N*K)-1 Log2(N) Log2(N)-1 0 M-TLB arbiter 32x4 M-TLB HPTWK Bus interface unit

MMU Functionality • Virtual memory addressing • Better memory allocation, less fragmentation • Allows shared memory • Dynamic loading • Memory protection (read/write/execute) • Different permission flags for kernel/user mode • OS typically runs in kernel mode • Applications run in user mode • Cache control (cached/uncached) • Accesses to peripherals and other processors needs to be uncached.

Direct Memory Access (DMA)

DMA overview • Two channels • One active channel • Programmed using physical addressing • For both instruction and data local memory • External address can be incremented with stride • Optional 2-D Element Transfer (2DET) feature which provides an easy way to transfer two-dimensional blocks from external memory. Local Memory DMA Controller Ext. Memory

Local MemoryBank 0 Local MemoryBank 1 LMDMA Double Buffer Mode CorePipeline ExternalMemory DMA Engine Computation Data Movement Bank Switch between core and DMA engine Width byte stride (in DMA Setup register)=1

Bus Interface Unit (BIU)

N1213-S BUS • AMBA 2.0 AHB bus • 1 port • 2 port • ICU/MMU (read only) for port 1 • LSU/DMA/EDM (read/write) for port 2 • HSMP • High speed memory port • Same frequency with CPU core • AMBA 3.0 (AXI) protocol compliant, but with reduced I/O requirements • 1 and 2 port configuration

BIU introduction • Bus Interface unit is responsible for off-CPU memory access which includes • System memory access • Instruction/data local memory access • Memory-mapped register access in devices.

Bus Interface • Compliance to AHB/AHB-Lite/APB • High Speed Memory Port • Andes Memory Interface • External LM Interface

HSMP – High speed memory port • N12 also provides a high speed memory port interface which has higher bus protocol efficiency and can run at a higher frequency to connect to a memory controller. • The high speed memory port will be AMBA3.0 (AXI) protocol compliant, but with reduced I/O requirements.

Andes Embedded Processor AndesCore TM N1213-S

Andes Embedded Processor AndesCore TM N1213-S

Presentation Transcript

BLDC MOTOR SPEED CONTROL USING EMBEDDED PROCESSOR

Embedded Processor Architecture

A KASSPER Real-Time Embedded Signal Processor Testbed

OpenVMS on the Itanium TM Processor Family

VistaPlus TM AP15 Audio Processor

Reconfigurable Embedded Processor Peripherals

ACOE343 - Real-Time Embedded Processor Systems

Advanced Processor Architectures for Embedded Systems

Embedded Processor Architecture

Low-Power Design for Embedded Processor

Embedded Processor 의 설계

AndeScore N1213-S

Microprocessor System Design Using Coldfire Embedded Processor

Introducing Moon the Next Generation Java TM Processor Core

Embedded Processor Architecture 5kk73

晶心科技 Andes Technology Innovate SOC Processors TM

Andes Embedded Processor AndesCore TM N1213-S

Global Embedded Processor Industry 2015 Research Report

Embedded Asterisk and the Blackfin Processor

A KASSPER Real-Time Embedded Signal Processor Testbed

Safety BASIC s TM

AndesCore TM N1213-S