arm introduction instruction set architecture l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
ARM Introduction & Instruction Set Architecture PowerPoint Presentation
Download Presentation
ARM Introduction & Instruction Set Architecture

Loading in 2 Seconds...

play fullscreen
1 / 71

ARM Introduction & Instruction Set Architecture - PowerPoint PPT Presentation


  • 621 Views
  • Uploaded on

ARM Introduction & Instruction Set Architecture. Aleksandar Milenkovic E-mail: milenka@ece.uah.edu Web: http://www.ece.uah.edu/~milenka. Outline. ARM Architecture ARM Organization and Implementation ARM Instruction Set Thumb Instruction Set Architectural Support for System Development

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'ARM Introduction & Instruction Set Architecture' - andrew


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
arm introduction instruction set architecture

ARMIntroduction &Instruction Set Architecture

Aleksandar Milenkovic

E-mail: milenka@ece.uah.edu

Web: http://www.ece.uah.edu/~milenka

outline
Outline
  • ARM Architecture
  • ARM Organization and Implementation
  • ARM Instruction Set
  • Thumb Instruction Set
  • Architectural Support for System Development
  • ARM Processor Cores
  • Memory Hierarchy
  • Architectural Support for Operating Systems
  • ARM CPU Cores
  • Embedded ARM Applications
arm history
ARM History
  • ARM – Acorn RISC Machine (1983 – 1985)
    • Acorn Computers Limited, Cambridge, England
  • ARM – Advanced RISC Machine 1990
    • ARM Limited, 1990
    • ARM has been licensed to many semiconductor manufacturers
arm s visible registers

r0

usable in user mode

r1

r2

r3

system modes only

r4

r5

r6

r7

r8_fiq

r8

r9_fiq

r9

r10_fiq

r10

r1

1_fiq

r1

1

r13_und

r12_fiq

r13_irq

r12

r13_abt

r13_svc

r14_und

r13_fiq

r14_irq

r13

r14_abt

r14_svc

r14_fiq

r14

r15 (PC)

SPSR_und

SPSR_irq

SPSR_abt

CPSR

SPSR_svc

SPSR_fiq

svc

abort

irq

undefi

ned

fiq

user mode

mode

mode

mode

mode

mode

ARM’s visible registers
  • User level
    • 15 GPRs, PC, CPSR (current program status register)
  • Remaining registers are used for system-level programming and for handling exceptions
arm cpsr format
ARM CPSR format
  • N (Negative), Z (Zero), C (Carry), V (oVerflow)
  • mode – control processor mode
  • T – control instruction set
    • T = 1 – instruction stream is 16-bit Thumb instructions
    • T = 0 – instruction stream is 32-bit ARM instructions
  • I F – interrupt enables
arm memory organization
ARM memory organization
  • Linear array of bytes numbered from 0 to 232 – 1
  • Data items
    • bytes (8 bits)
    • half-words (16 bits) – always aligned to 2-byte boundaries (start at an even byte address)
    • words (32 bits) – always aligned to 4-byte boundaries (start at a byte address which is multiple of 4)
arm instruction set
ARM instruction set
  • Load-store architecture
    • operands are in GPRs
    • load/store – only instructions that operate with memory
  • Instructions
    • Data Processing – use and change only register values
    • Data Transfer – copy memory values into registers (load) or copy register values into memory (store)
    • Control Flow
      • branch
      • branch-and-link – save return address to resume the original sequence
      • trapping into system code – supervisor calls
arm instruction set cont d
ARM instruction set (cont’d)
  • Three-address data processing instructions
  • Conditional execution of every instruction
  • Powerful load/store multiple register instructions
  • Ability to perform a general shift operation and a general ALU operation in a single instruction that executes in a single clock cycle
  • Open instruction set extension through coprocessor instruction set, including adding new registers and data types to the programmer’s model
  • Very dense 16-bit compressed representation of the instruction set in the Thumb architecture
i o system
I/O system
  • I/O is memory mapped
    • internal registers of peripherals (disk controllers, network interfaces, etc) are addressable locations within the ARM’s memory map and may be read and written using the load-store instructions
  • Peripherals may use either the normal interrupt (IRQ) or fast interrupt (FIQ) input
    • normally most interrupt sources share the IRQ input, while just one or two time-critical sources are connected to the FIQ input
  • Some systems may include external DMA hardware to handle high-bandwidth I/O traffic
arm exceptions
ARM exceptions
  • ARM supports a range of interrupts, traps, and supervisor calls – all are grouped under the general heading of exceptions
  • Handling exceptions
    • current state is saved by copying the PC into r14_exc and CPSR into SPSR_exc (exc stands for exception type)
    • processor operating mode is changed to the appropriate exception mode
    • PC is forced to a value between 0016 and 1C16, the particular value depending on the type of exception
    • instruction at the location PC is forced to (the vector address) usually contains a branch to the exception handler; the exception handler will use r13_exc, which is normally initialized to point to a dedicated stack in memory, to save some user registers
    • return: restore the user registers and then restore PC and CPSR atomically
arm cross development toolkit

C source

C libraries

asm source

C compiler

as

sembler

.aof

object

libraries

linker

.axf

debug

ARMsd

system model

development

ARMulator

board

ARM cross-development toolkit
  • Software development
    • tools developed by ARM Limited
    • public domain tools (ARM back end for gcc C compiler)
  • Cross-development
    • tools run on different architecture from one for which they produce code
outline12
Outline
  • ARM Architecture
  • ARM Assembly Language Programming
  • ARM Organization and Implementation
  • ARM Instruction Set
  • Architectural Support for High-level Languages
  • Thumb Instruction Set
  • Architectural Support for System Development
  • ARM Processor Cores
  • Memory Hierarchy
  • Architectural Support for Operating Systems
  • ARM CPU Cores
  • Embedded ARM Applications
arm instruction set13
ARM Instruction Set
  • Data Processing Instructions
  • Data Transfer Instructions
  • Control flow Instructions
data processing instructions
Data Processing Instructions
  • Classes of data processing instructions
    • Arithmetic operations
    • Bit-wise logical operations
    • Register-movement operations
    • Comparison operations
  • Operands: 32-bits wide;there are 3 ways to specify operands
    • come from registers
    • the second operand may be a constant (immediate)
    • shifted register operand
  • Result: 32-bits wide, placed in a register
    • long multiply produces a 64-bit result
data processing instructions cont d
Data Processing Instructions (cont’d)

Arithmetic Operations

Bit-wise Logical Operations

Register Movement

Comparison Operations

data processing instructions cont d16
Immediate operands:immediate = (0->255) x 22n, 0 <= n <= 12

Shifted register operands

the second operand is subject to a shift operation before it is combined with the first operand

Data Processing Instructions (cont’d)
arm shift operations
ARM shift operations
  • LSL – Logical Shift Left
  • LSR – Logical Shift Right
  • ASR – Arithmetic Shift Right
  • ROR – Rotate Right
  • RRX – Rotate Right Extended by 1 place
setting the condition codes
Setting the condition codes
  • Any DPI can set the condition codes (N, Z, V, and C)
    • for all DPIs except the comparison operations a specific request must be made
    • at the assembly language level this request is indicated by adding an `S` to the opcode
    • Example (r3-r2 := r1-r0 + r3-r2)
  • Arithmetic operations set all the flags (N, Z, C, and V)
  • Logical and move operations set N and Z
    • preserve V and either preserve C when there is no shift operation, or set C according to shift operation (fall off bit)
multiplies
Multiplies
  • Example (Multiply, Multiply-Accumulate)
  • Note
    • least significant 32-bits are placed in the result register, the rest are ignored
    • immediate second operand is not supported
    • result register must not be the same as the first source register
    • if `S` bit is set the V is preserved and the C is rendered meaningless
  • Example (r0 = r0 x 35)
    • ADD r0, r0, r0, LSL #2 ; r0’ = r0 x 5RSB r3, r3, r1 ; r0’’ = 7 x r0’
data transfer instructions
Data transfer instructions
  • Single register load and store instructions
    • transfer of a data item (byte, half-word, word) between ARM registers and memory
  • Multiple register load and store instructions
    • enable transfer of large quantities of data
    • used for procedure entry and exit, to save/restore workspace registers, to copy blocks of data around memory
  • Single register swap instructions
    • allow exchange between a register and memory in one instruction
    • used to implement semaphores to ensure mutual exclusion on accesses to shared data in multis
data transfer instructions cont d
Data Transfer Instructions (cont’d)

Register-indirect addressing

Single register load and store

Note: r1 keeps a word address (2 LSBs are 0)

Base+offset addressing (offset of up to 4Kbytes)

Note: no restrictions for r1

Auto-indexing addressing

Post-indexed addressing

data transfer instructions23
Block copy view

data is to be stored above or below the the address held in the base register

address incrementing or decrementing begins before or after storing the first value

Data Transfer Instructions

Multiple register data transfers

Note: any subset (or all) of the registers may be transferred with a single instruction

Note: the order of registers within the list is insignificant

Note: including r15 in the list will cause a change in the control flow

  • Stack organizations
    • FA – full ascending
    • EA – empty ascending
    • FD – full descending
    • ED – empty descending
multiple register transfer addressing modes
Multiple register transfer addressing modes

1018

1018

r9’

r9’

r5

16

16

r5

r1

r1

r0

r9

r0

100c

r9

100c

16

16

1000

1000

16

16

STMIA r9!, {r0,r1,r5}

STMIB r9!, {r0,r1,r5}

1018

1018

16

16

r9

r5

100c

r9

100c

16

16

r1

r5

r0

r1

1000

1000

r9’

r9’

r0

16

16

STMDA r9!, {r0,r1,r5}

STMDB r9!, {r0,r1,r5}

conditional execution
Conditional execution
  • Conditional execution to avoid branch instructions used to skip a small number of non-branch instructions
  • Example

With conditional execution

Note: add 2 –letter condition after the 3-letter opcode

branch and link instructions
Branch and link instructions
  • Branch to subroutine (r14 serves as a link register)
  • Nested subroutines
supervisor calls
Supervisor calls
  • Supervisor is a program which operates at a privileged level – it can do things that a user-level program cannot do directly
    • Example: send text to the display
  • ARM ISA includes SWI (SoftWare Interrupt)
jump tables
Jump tables
  • Call one of a set of subroutines depending on a value computed by the program

Note: slow when the list is long, and all subroutines are equally frequent

arm organization and implementation

ARMOrganization and Implementation

Aleksandar Milenkovic

E-mail: milenka@ece.uah.edu

Web: http://www.ece.uah.edu/~milenka

outline33
Outline
  • ARM Architecture
  • ARM Organization and Implementation
  • ARM Instruction Set
  • Architectural Support for High-level Languages
  • Thumb Instruction Set
  • Architectural Support for System Development
  • ARM Processor Cores
  • Memory Hierarchy
  • Architectural Support for Operating Systems
  • ARM CPU Cores
  • Embedded ARM Applications
arm organization
ARM organization

A[31:0]

control

address register

  • Register file –
    • 2 read ports, 1 write port + 1 read, 1 write port reserved for r15 (pc)
  • Barrel shifter – shift or rotate one operand for any number of bits
  • ALU – performs the arithmetic and logic functions required
  • Memory address register + incrementer
  • Memory data registers
  • Instruction decoder and associated control logic

P

incrementer

C

PC

register

bank

instruction

decode

A

multiply

&

L

register

U

control

A

B

b

u

b

b

s

u

u

barrel

s

s

shifter

ALU

data out register

data in register

D[31:0]

three stage pipeline
Three-stage pipeline
  • Fetch
    • the instruction is fetched from memory and placed in the instruction pipeline
  • Decode
    • the instruction is decoded and the datapath control signals prepared for the next cycle; in this stage the instruction owns the decode logic but not the datapath
  • Execute
    • the instruction owns the datapath; the register bank is read, an operand shifted, the ALU register generated and written back into a destination register
arm single cycle instruction pipeline37

decode

execute add

fetch

decode

execute sub

fetch

decode

ARM single-cycle instruction pipeline

fetch

add r0,r1,#5

sub r2,r3,r6

execute cmp

cmp r2,#3

time

1

2

3

arm multi cycle instruction pipeline
ARM multi-cycle instruction pipeline

Decode logic is always generating the control signals for the datapath to use in the next cycle

arm multi cycle ldmia load multiple instruction
ARM multi-cycle LDMIA (load multiple) instruction

Decode stage occupied

since ldmia must continue to

remember decoded instruction

fetch

decode

ex ld r2

ex ld r3

ldmia

r0,{r2,r3}

sub r2,r3,r6

fetch

decode

ex sub

fetch

decode

ex cmp

cmp r2,#3

time

Instruction delayed

sub fetched at normal time but

not decoded until LDMIA is finishing

control stalls due to branches
Control stalls: due to branches
  • Branches often introduce stalls (branch penalty)
    • Stall time may depend on whether branch is taken
  • May have to squash instructions that already started executing
  • Don’t know what to fetch until condition is evaluated
arm pipelined branch

fetch

decode

ex bne

ex bne

ex bne

bne foo

sub

r2,r3,r6

fetch

decode

fetch

decode

ex add

foo add

r0,r1,r2

ARM pipelined branch

Decision not made until the third clock cycle

Two cycles of work thrown

away if bne takes place

time

pipeline how it works
Pipeline: how it works
  • All instructions occupy the datapath for one or more adjacent cycles
  • For each cycle that an instruction occupies the datapath, it occupies the decode logic in the immediately preceding cycle
  • During the fist datapath cycle each instruction issues a fetch for the next instruction but one
  • Branch instruction flush and refill the instruction pipeline
arm9tdmi 5 stage pipeline

next

pc

+

4

I-cache

fetch

pc + 4

pc

+

8

I decode

instruction

r15

decode

register read

immediate

fields

mul

LDM/

STM

post-

index

+4

reg

shift

shift

execute

pre-index

ALU

forwarding

paths

mux

B, BL

MOV pc

SUBS pc

byte repl.

buffer/

D-cache

load/store

data

address

rot/sgn ex

LDR pc

write-back

register write

ARM9TDMI 5-stage pipeline
  • Fetch
  • Decode
    • instruction is decoded
    • register operands read (3 read ports)
  • Execute
    • an operand is shifted and the ALU result generated, or
    • address is computed
  • Buffer/data
    • data memory is accessed (load, store)
  • Write-back
    • write to register file
arm9tdmi data forwarding

next

pc

+

4

I-cache

fetch

pc + 4

pc

+

8

I decode

instruction

r15

decode

register read

immediate

fields

mul

LDM/

STM

post-

index

+4

reg

shift

shift

execute

pre-index

ALU

forwarding

paths

mux

B, BL

MOV pc

SUBS pc

byte repl.

buffer/

D-cache

load/store

data

address

rot/sgn ex

LDR pc

write-back

register write

ARM9TDMI Data Forwarding

Data Forwarding

Stall?

arm9tdmi pc generation

next

pc

+

4

I-cache

fetch

pc + 4

pc

+

8

I decode

instruction

r15

decode

register read

immediate

fields

mul

LDM/

STM

post-

index

+4

reg

shift

shift

execute

pre-index

ALU

forwarding

paths

mux

B, BL

MOV pc

SUBS pc

byte repl.

buffer/

D-cache

load/store

data

address

rot/sgn ex

LDR pc

write-back

register write

ARM9TDMI PC generation
  • 3-stage pipeline
    • PC behavior: operands are read in execution stage r15 = PC + 8
  • 5-stage pipeline
    • operands are read in decode stage and r15 = PC + 4?
    • incompatibilities between 3-stage and 5-stage implementations => unacceptable
    • to avoid this 5-stage pipeline ARMs emulate the behavior of the older 3-stage designs
data processing instruction datapath activity ex

address register

address register

increment

increment

Rd

PC

Rd

PC

registers

registers

Rn

Rm

Rn

mult

mult

as ins.

as ins.

as instruction

as instruction

[7:0]

data out

data in

i. pipe

data out

data in

i. pipe

Data processing instruction datapath activity (Ex)
  • Reg-Reg
    • Rd = Rn op Rm
    • r15 = AR + 4AR = AR + 4
  • Reg-Imm
    • Rd = Rn op Imm
    • r15 = AR + 4AR = AR + 4

(a) register – register operations

(b) register – immediate operations

str store register datapath activity ex1 ex2

address register

increment

Rn

PC

registers

Rd

mult

shifter

=

A

+ B /

A

- B

byte?

data in

i. pipe

STR (store register) datapath activity(Ex1, Ex2)
  • Compute address (Ex1)
    • AR = Rn op Disp
    • r15 = AR + 4
  • Store data (Ex2)
    • AR = PC
    • mem[AR] = Rd<x:y>
    • If autoindexing=>Rn = Rn +/- 4

address register

increment

PC

registers

Rn

mult

lsl #0

=

A

/

A

+ B /

A

- B

[11:0]

data out

data in

i. pipe

(b) 2nd cycle – store data & auto-index

(a) 1st cycle – compute address

the first two of three cycles of a branch instruction

address register

address register

increment

increment

R14

registers

registers

PC

PC

mult

mult

shifter

lsl #2

=

A

=

A

+ B

[23:0]

data out

data in

i. pipe

data out

data in

i. pipe

The first two (of three) cycles of a branch instruction
  • Compute target address
    • AR = PC + Disp,lsl #2
  • Save return address (if required)
    • r14 = PC
    • AR = AR + 4

Third cycle: do a small correction to the value stored in the link register in order that it points to directly at the instruction which follows the branch?

(b) 2nd cycle – save return address

(a) 1st cycle – compute branch target

arm implementation
ARM Implementation
  • Datapath
    • RTL (Register Transfer Level)
  • Control unit
    • FSM (Finite State Machine)
2 phase non overlapping clock scheme
2-phase non-overlapping clock scheme
  • Most ARMs do not operate on edge-sensitive registers
  • Instead the design is based around 2-phase non-overlapping clocks which are generated internally from a single clock signal
  • Data movement is controlled by passing the data alternatively through latches which are open during phase 1 or latches during phase 2
arm datapath timing
ARM datapath timing
  • Register read
    • Register read buses – dynamic, precharged during phase 2
    • During phase 1 selected registers discharge the read buses which become valid early in phase 1
  • Shift operation
    • second operand passes through barrel shifter
  • ALU operation
    • ALU has input latches which are open in phase 1,allowing the operands to begin combining in ALU as soon as they are valid, but they close at the end of phase 1 so that the phase 2 precharge does not get through to the ALU
    • ALU processes the operands during the phase 2, producing the valid output towards the end of the phase
    • the result is latched in the destination register at the end of phase 2
arm datapath timing cont d
ARM datapath timing (cont’d)

Minimum Datapath Delay = Register read time + Shifter Delay + ALU Delay + Register write set-up time + Phase 2 to phase 1 non-overlap time

the original arm1 ripple carry adder
The original ARM1 ripple-carry adder
  • Carry logic: use CMOS AOI (And-Or-Invert) gate
  • Even bits use circuit show below
  • Odd bits use the dual circuit with inverted inputs and outputs and AND and OR gates swapped around
  • Worst case path:32 gates long
arm2 4 bit carry look ahead scheme
ARM2 4-bit carry look-ahead scheme
  • Carry Generate (G)Carry Propagate (P)
  • Cout[3] =Cin[0].P + G
  • Use AOI and alternate AND/OR gates
  • Worst case:8 gates long
the arm2 alu logic for one result bit
The ARM2 ALU logic for one result bit
  • ALU functions
    • data operations (add, sub, ...)
    • address computations for memory accesses
    • branch target computations
    • bit-wise logical operations
    • ...
the arm6 carry select adder scheme
The ARM6 carry-select adder scheme
  • Compute sums of various fields of the wordfor carry-in of zero and carry-in of one
  • Final result is selected by using the correct carry-in value to control a multiplexor

Note: Be careful! Fan-out on some of these gates is high so direct comparison with previous schemes is not applicable.

Worst case: O(log2[word width]) gates long

the arm6 alu organization
The ARM6 ALU organization
  • Not easy to merge the arithmetic and logic functions =>a separate logic unit runs in parallel with the adder, and multiplexor selects the output
arm9 carry arbitration encoding
ARM9 carry arbitration encoding
  • Carry arbitration adder
the cross bar switch barrel shifter

right 3

right 2

right 1

no shift

in[3]

left 1

in[2]

left 2

in[1]

left 3

in[0]

out[0]

out[1]

out[2]

out[3]

The cross-bar switch barrel shifter
  • Shifter delay is critical since it contributes directly to the datapath cycle time
  • Cross-bar switch matrix (32 x 32)
  • Principle for 4x4 matrix
the cross bar switch barrel shifter cont d
The cross-bar switch barrel shifter (cont’d)
  • Precharged logic is used => each switch is a single NMOS transistor
  • Precharging sets all outputs to logic 0, so those which are not connected to any input during switching remain at 0 giving the zero filling required by the shift semantics
  • For rotate right, the right shift diagonal is enabled + complementary shift left diagonal (e. g., ‘right 1’ + ‘left 3’)
  • Arithmetic shift right:use sign-extension => separate logic is used to decode the shift amount and discharge those outputs appropriately
multiplier design
Multiplier design
  • All ARMs apart form the first prototype have included support for integer multiplication
    • older ARM cores include low-cost multiplication hardwarethat supports only the 32-bit result multiply and multiply-accumulate
    • recent ARM cores have high-performance multiplication hardware and support 64-bit result multiply andmultiply-accumulate
  • Low cost implementation
    • Use the datapath iteratively, employing the barrel shifterand ALU to generate 2-bit product in each clock cycle
    • use early termination to stop the iterations when there are no more ones in the multiply register
the 2 bit multiplication algorithm nth cycle
The 2-bit multiplication algorithm, Nth cycle
  • Control settings for the Nth cycle of the multiplication
  • Use existing shifter and ALU + additional hardware
    • dedicated two-bits-per-cycle shift register for the multiplier and a few gates for the Booth’s algorithm control logic(overhead is a few per cent on the area of ARM core)
high speed multiplication
High speed multiplication
  • Where multiplication performance is very important, more hardware resources must be dedicated
    • in some embedded systems the ARM core is used to perform real-time digital signal processing (DSP) – DSP programs are typically multiplication intensive
  • Use intermediate results which include partial sums and partial carries
    • Carry-save adders are used for this
  • These two binary results are added together at the end of multiplication
    • The main ALU is used for this
carry propagate a and carry save b adder structures
Carry-propagate (a) and carry-save (b) adder structures
  • Carry propagate adder takes two conventional (irredundant) binary numbers as inputs and produces a binary sum
  • Carry save adder takes one binary and one redundant (partial sum and partial carry) input and produces a sum in redundant binary representation (sum and carry)
arm high speed multiplier organization
ARM high-speed multiplier organization
  • CSA has 4 layers of adders each handling 2 multiplier bits=> multiply 8-bits per clock cycle
  • Partial sum and carry are cleared at the beginningor initialized to accumulate a value
  • Multiplier is shifted right 8-bitsper cycle in the ‘Rs’ register
  • Carry sum and carryare rotated right 8 bits per cycle
  • Performance: up to 4 clock cycles (early termination is possible)
  • Complexity: 160 bits in shift registers, 128 bits of carry-save adder logic (up to 10% of simpler cores)