- 132 Views
- Uploaded on
- Presentation posted in: General

Chapter 2

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

CprE 381 Computer Organization and Assembly Level Programming, Fall 2013

Chapter 2

Instructions: Language of the Computer

Zhao Zhang

Iowa State University

Revised from original slides provided by MKP

- MIPS procedure/function call convention
- Leaf and non-leaf examples
- Clearing array example
- String copy example
- Other issues:
- Load 32-bit immediate
- Assembler, loader, and compiler effects

§2.8 Supporting Procedures in Computer Hardware

Chapter 2 — Instructions: Language of the Computer — 2

- Exam 1 on Friday Oct. 4
- Course review on Wednesday Oct. 2
- HW4 is due on Sep. 27
- HW5 will be due on Oct. 11
- Do HW5 as exercise before Exam 1
- No HW and quizzes next week

- Lab 2 demo is due this week and Lab 3 demo due next week
- Lab 4 starts next week, due in one week

Chapter 1 — Computer Abstractions and Technology — 3

- Open book, open notes, calculator are allowed
- E-book reader is allowed
- Must be put in airplane mode

- Coverage
- Chapter 1, Computer Abstraction and Technology
- Chapter 2, Instructions: Language of the Computer
- Some contents from Appendix B
- MIPS floating-point instructions

Chapter 1 — Computer Abstractions and Technology — 4

- Short conceptual questions
- Calculation: speedup, power saving, CPI, etc.
- MIPS assembly programming
- Translate C statements to MIPS (arithmetic, load/store, branch and jump, others)
- Translate C functions to MIPS (call convention)

- Among others
Suggestions:

- Review slidesand textbook
- Review homework and quizzes

Chapter 1 — Computer Abstractions and Technology — 5

Overview for Week 5, Sep. 23 - 27

- Bubble sorting example
- It will be used in Mini-Projects

- Floating point instructions
- ARM and x86 instruction set overview

Chapter 1 — Computer Abstractions and Technology — 6

- Bubble sort: Swap two adjacent elements if they are out of order
- Pass the array n times, each time a largest element will float to the top
- Look at the first pass of five elements
1st try: 5 3 8 2 7 => 3 5 8 2 7

2nd try: 3 5 8 2 7 => 3 5 8 2 7

3rd try: 3 5 827 => 3 5 2 87

4th try: 3 5 2 7 8=> 3 5 2 7 8

Chapter 1 — Computer Abstractions and Technology — 7

- Pass i only has to check for (n-i) swaps
- In each pass, an element may float up until it meets a larger element
- The sorted sub-array increments by one
1st pass: 5 3 8 2 7 => 3 5 2 7 8

2nd pass: 3 5 2 7 8=> 3 2 5 7 8

3ndpass: 3 2 5 7 8 => 2 3 5 7 8

4ndpass: 2 3 5 7 8 => 2 3 5 7 8

Chapter 1 — Computer Abstractions and Technology — 8

- The textbook bubble-sort is optimized to reduce comparisons
void sort (int v[], int n)

{

inti, j;

for (i= 0; i < n; i++) {

for (j = i – 1; j >= 0 && v[j] > v[j+1]; j--)

swap(v, j);

}

}

Chapter 1 — Computer Abstractions and Technology — 9

- The classic one let a largest element float to the top of the unsorted sub-array
- The revised one let an element float to its right place in the sorted sub-array
1stpass: 538 2 7 => 3 58 2 7

2ndpass: 3 58 2 7 => 3582 7

3nd pass: 3582 7 => 2 3 5 8 7

4nd pass: 2 3 5 87=> 2 3 5 7 8

Chapter 1 — Computer Abstractions and Technology — 10

- The swap function is a leaf function
void swap(int v[], int k){int temp; temp = v[k]; v[k] = v[k+1]; v[k+1] = temp;}

- v in $a0, k in $a1, temp in $t0

§2.13 A C Sort Example to Put It All Together

Chapter 2 — Instructions: Language of the Computer — 11

swap: sll $t1, $a1, 2 # $t1 = k * 4

add $t1, $a0, $t1 # $t1 = v+(k*4)

# (address of v[k])

lw $t0, 0($t1) # $t0 (temp) = v[k]

lw $t2, 4($t1) # $t2 = v[k+1]

sw $t2, 0($t1) # v[k] = $t2 (v[k+1])

sw $t0, 4($t1) # v[k+1] = $t0 (temp)

jr $ra # return to calling routine

Chapter 2 — Instructions: Language of the Computer — 12

for (i = 0; i < n; i++) {

for (j = i – 1; j >= 0 && v[j] > v[j+1]; j--)

swap(v, j);

}

- Save $ra to stack, as it’s a non-leaf function
- Assign i and j to $s0 and $s1
- They must be preserved when calling swap()

- Move v, n from $a0 and $a1 to $s2 and $s2
- They must be preserved, too
- $a0 and $a1 are used when calling swap()

- We need a stack frame of 5 words or 20 bytes

Chapter 1 — Computer Abstractions and Technology — 13

sort: addi $sp,$sp, –20 # make room on stack for 5 registers

sw $ra, 16($sp) # save $ra on stack

sw $s3,12($sp) # save $s3 on stack

sw $s2, 8($sp) # save $s2 on stack

sw $s1, 4($sp) # save $s1 on stack

sw $s0, 0($sp) # save $s0 on stack

… # procedure body

…

exit1: lw $s0, 0($sp) # restore $s0 from stack

lw $s1, 4($sp) # restore $s1 from stack

lw $s2, 8($sp) # restore $s2 from stack

lw $s3,12($sp) # restore $s3 from stack

lw $ra,16($sp) # restore $ra from stack

addi $sp,$sp, 20 # restore stack pointer

jr $ra # return to calling routine

- Entry: Get a frame, save $ra and $s3-$s0
- Exit: Restore $s0-$s3 and $ra, free the frame

Chapter 2 — Instructions: Language of the Computer — 14

A new pseudo instruction

moverd, rs

is equivalent to

add rd, rs, $zero

Example

move $s2, $a0 # $s2 = $zero

move $s3, $a1 # $s3 = $a1

No use of pseudo assembly instructions in Exam 1

Chapter 1 — Computer Abstractions and Technology — 15

Moveparams

move $s2, $a0 # save $a0 into $s2

move $s3, $a1 # save $a1 into $s3

move $s0, $zero # i = 0

for1tst: slt $t0, $s0, $s3 # $t0 = 0 if $s0 ≥ $s3 (i ≥ n)

beq $t0, $zero, exit1 # go to exit1 if $s0 ≥ $s3 (i ≥ n)

addi $s1, $s0, –1 # j = i – 1

for2tst: slti $t0, $s1, 0 # $t0 = 1 if $s1 < 0 (j < 0)

bne $t0, $zero, exit2 # go to exit2 if $s1 < 0 (j < 0)

sll $t1, $s1, 2 # $t1 = j * 4

add $t2, $s2, $t1 # $t2 = v + (j * 4)

lw $t3, 0($t2) # $t3 = v[j]

lw $t4, 4($t2) # $t4 = v[j + 1]

slt $t0, $t4, $t3 # $t0 = 0 if $t4 ≥ $t3

beq $t0, $zero, exit2 # go to exit2 if $t4 ≥ $t3

move $a0, $s2 # 1st param of swap is v (old $a0)

move $a1, $s1 # 2nd param of swap is j

jal swap # call swap procedure

addi $s1, $s1, –1 # j –= 1

j for2tst # jump to test of inner loop

exit2: addi $s0, $s0, 1 # i += 1

j for1tst # jump to test of outer loop

Outer loop

Inner loop

Passparams& call

Inner loop

Outer loop

Chapter 2 — Instructions: Language of the Computer — 16

Old version:

void sort(int v[], int n)

inti, j;

for (i = 0; i < n; i++) {

for (j = i – 1; j >= 0 && v[j] > v[j+1]; j--)

swap(v, j);

}

New version:

void sort(int v[], int n)

{

int *pi, *pj;

for (pi = v; pi < &v[n]; pi++)

for (pj= pj - 1; pj>= v && swap(pj); pj--)

{}

}

Chapter 1 — Computer Abstractions and Technology — 17

- A more efficient swap function that reduces memory loads
// swap two adjacent elements if they are

// out of order. Return 1 if swapped, 0

// otherwise

int swap(int *p)

{

if (p[0] > p[1]) {

inttmp = p[0];

p[0] = p[1];

p[1] = tmp;

return 1;

}

else

return 0;

}

Chapter 1 — Computer Abstractions and Technology — 18

- A new swap function
swap:

lw $t0, 0($a0) # load p[0]

lw $t1, 4($a0) # load p[1]

slt $t2, $t1, $t0 # p[1] < p[0]?

beq$t2, $zero, else

sw $t1, 0($a0) # swap

sw $t0, 4($a0) # swap

addi $v0, $zero, 1 # $v0 = 1

jr $ra

else:

addi $v0, $zero, 0 # $v0 = 0

jr $ra

Chapter 1 — Computer Abstractions and Technology — 19

The sort() function optimized

- Register usage
- $s0: v
- $s1: &v[n]
- $s2: pi
- $s3: pj

- Need a frame of 5 words to save $ra and $s0-$s2

Chapter 1 — Computer Abstractions and Technology — 20

sort:

addi $sp, $sp, -20 # frame of 5 words

sw $ra, 16($sp)

sw $s3, 12($sp)

sw$s2, 8($sp)

sw$s1, 4($sp)

sw$s0, 0($sp)

lw $s0, 0($sp)

lw$s1, 4($sp)

lw$s2, 8($sp)

lw$s3, 12($sp)

lw $ra, 16($sp)

addi $sp, $sp, 20 # release frame

jr $ra

MIPS code for sort function body

Chapter 1 — Computer Abstractions and Technology — 21

for (pi = v; pi < &v[n]; pi++)

for (pj = pj - 1; pj >= v && swap(pj); pj--)

{}

add $s0, $a0, $zero # $s0 = v

sll $a1, $a1, 2 # $a1 = 4*n

add $s1, $s0, $a1 # $s1 = &v[n]

add $s2, $s0, $zero # pi = v

j for1_tst

for1_loop:

addi$s2, $s2, 4 # pi++

for1_tst:

slt $t0, $s2, $s1 # pi < &v[n]?

bne $t0, $zero, for1_loop # yes? repeat

C code for the inner loop

MIPS code for the inner loop

Chapter 1 — Computer Abstractions and Technology — 22

for (pj= pi-1; pj>= v && swap(pj); pj--)

{}

addi $s3, $s2, -4 # pj = pi-1

j for2_tst

for2_loop:

addi $s3, $s3, -4 # pj--

for2_tst:

slt $t0, $s3, $s0 # pj < v?

bne $t0, $zero,for2_exit # yes? exit

add $a0, $s3, $zero # $a0 = pj

jal swap # swap(pj)

bne $v0, $zero,for2_loop # ret 1? cont

for2_exit:

Chapter 1 — Computer Abstractions and Technology — 23

- You will use the sorting code to test your CPU design in the lab mini-projects
- Use the new sorting code
- The new code is more optimized
- It will simplify the debugging

Chapter 1 — Computer Abstractions and Technology — 24

Reading: Textbook Ch. 3.5 and B-71 – B80

- FP hardware is coprocessor 1
- Adjunct processor that extends the ISA

- Separate FP registers
- 32 single-precision: $f0, $f1, … $f31
- Paired for double-precision: $f0/$f1, $f2/$f3, …
- Release 2 of MIPS ISA supports 32 × 64-bit FP reg’s

Chapter 3 — Arithmetic for Computers — 25

- FP instructions operate only on FP registers
- Programs generally don’t do integer ops on FP data, or vice versa
- More registers with minimal code-size impact

Chapter 1 — Computer Abstractions and Technology — 26

- FP load and store instructions
- lwc1, ldc1, swc1, sdc1
- e.g., ldc1 $f8, 32($sp)

- lwc1, swc1: Load/store single-precision
- ldc1, swc1: Load/store double-precision

- lwc1, ldc1, swc1, sdc1

Chapter 1 — Computer Abstractions and Technology — 27

- Single-precision arithmetic
- add.s, sub.s, mul.s, div.s
- e.g., add.s $f0, $f1, $f6

- add.s, sub.s, mul.s, div.s
- Double-precision arithmetic
- add.d, sub.d, mul.d, div.d
- e.g., mul.d $f4, $f4, $f6

- add.d, sub.d, mul.d, div.d

Chapter 3 — Arithmetic for Computers — 28

- Single- and double-precision comparison
- c.xx.s, c.xx.d (xx is eq, lt, le, …)
- Sets or clears FP condition-code bit
- e.g. c.lt.s $f3, $f4

- Branch on FP condition code true or false
- bc1t, bc1f
- e.g., bc1t TargetLabel

- bc1t, bc1f

Chapter 1 — Computer Abstractions and Technology — 29

- The first two FP parameters in registers
- 1st parameter in $f12 or $f12:$f13
- A double-precision parameter takes two registers

- 2nd FP parameter in $f14or $f14:$f15
- Extra parameters in stack

- 1st parameter in $f12 or $f12:$f13
- $f0 stores single-precision FP return value
- $f0:$f1 stores double-precision FP return value
- $f0-$f19 are FP temporary registers
- $f20-$f31 are FP saved temporary registers

Chapter 1 — Computer Abstractions and Technology — 30

- C code:
float f2c (float fahr)

{ return ((5.0/9.0) * (fahr - 32.0));}

- fahr in $f12, result in $f0
- Assume literals in global memory space, e.g. const5 for 5.0 and const9 for 9.0
- Can FP immediate be encoded in MIPS instructions?

Chapter 3 — Arithmetic for Computers — 31

- Compiled MIPS code:
f2c: lwc1 $f16, const5($gp)lwc1 $f18, const9($gp)div.s $f16, $f16, $f18 lwc1 $f18, const32($gp)sub.s $f18, $f12, $f18mul.s $f0, $f16, $f18jr $ra

Chapter 1 — Computer Abstractions and Technology — 32

extern float fahr, cel;

cel = f2c(fahr);

Assume fahris at 100($gp), celis at 104($gp)

lwc1 $f12, 100($gp) # load 1stpara

jal f2c

swcl $f0, 104($gp); # save ret val

Chapter 1 — Computer Abstractions and Technology — 33

double max(double x, double y)

{

return (x > y) ? x : y;

}

max:

c.lt.d $f14, $f12 # y < x?

bc1f else # if false, do else

mov.d $f0, $f12 # $f0:$f1 = x

jr $ra

else:

mov.d $f0, $f14 # $f0:$f1 = y

jr $ra

Chapter 1 — Computer Abstractions and Technology — 34

- How to call max?
- Assume a, b, c at 100($gp), 108($gp), and 116($gp)
extern double a, b, c;

c = max(a, b);

ldc1 $f12, 100($gp) # $f12:$f13 = a

ldc1 $f14, 108($gp) # $f14:$f15 = b

jal max

sdc1 $f0, 116($gp) # c = $f0:$f1

- Assume a, b, c at 100($gp), 108($gp), and 116($gp)

Chapter 1 — Computer Abstractions and Technology — 35

int search(double X[], int size, double value)

{

for (inti = 0; i < size; i++)

if (X[i] == value)

return 1;

return 0;

}

Note 1: There are integer and FP parameters, and the return value is integer

Note 2: A real program may search a value in a range, e.g. [value - delta, value + delta]

Chapter 1 — Computer Abstractions and Technology — 36

search:

add $t0, $zero, $zero # i = 0

j for_cond

for_loop:

sll $t1, $t0, 3 # $t1 = 8*i

add $t1, $a0, $t1 # $t1 = &X[i]

lwc1 $f2, 0($t1) # $f2 = X[i]

c.eq.d $f2, $f12 # X[i] == value?

bc1f endif # if false, skip

addi $v0, $zero, 1 # $v0 = 1

jr $ra # return

endif:

addi $t0, $t0, 1 # i++

for_cond:

slt $t1, $t0, $a1 # i < size?

bne $t1, $zero, for_loop # repeat if true

add $v0, $zero, $zero # to return 0

jr $ra

Chapter 1 — Computer Abstractions and Technology — 37

- X = X + Y × Z
- All 32 × 32 matrices, 64-bit double-precision elements

- C code:
void mm (double x[][], double y[][], double z[][]) { int i, j, k; for (i = 0; i! = 32; i = i + 1) for (j = 0; j! = 32; j = j + 1) for (k = 0; k! = 32; k = k + 1) x[i][j] = x[i][j] + y[i][k] * z[k][j];}

- Addresses of x, y, z in $a0, $a1, $a2, andi, j, k in $s0, $s1, $s2

Chapter 3 — Arithmetic for Computers — 38

- MIPS code:
li $t1, 32 # $t1 = 32 (row size/loop end) li $s0, 0 # i = 0; initialize 1st for loopL1: li $s1, 0 # j = 0; restart 2nd for loopL2: li $s2, 0 # k = 0; restart 3rd for loop sll $t2, $s0, 5 # $t2 = i * 32 (size of row of x)addu $t2, $t2, $s1 # $t2 = i * size(row) + jsll $t2, $t2, 3 # $t2 = byte offset of [i][j] addu $t2, $a0, $t2 # $t2 = byte address of x[i][j] l.d $f4, 0($t2) # $f4 = 8 bytes of x[i][j]L3: sll $t0, $s2, 5 # $t0 = k * 32 (size of row of z) addu $t0, $t0, $s1 # $t0 = k * size(row) + j sll $t0, $t0, 3 # $t0 = byte offset of [k][j] addu $t0, $a2, $t0 # $t0 = byte address of z[k][j] l.d $f16, 0($t0) # $f16 = 8 bytes of z[k][j] …

Chapter 3 — Arithmetic for Computers — 39

…sll $t0, $s0, 5 # $t0 = i*32 (size of row of y) addu $t0, $t0, $s2 # $t0 = i*size(row) + k sll $t0, $t0, 3 # $t0 = byte offset of [i][k] addu $t0, $a1, $t0 # $t0 = byte address of y[i][k] l.d $f18, 0($t0) # $f18 = 8 bytes of y[i][k]mul.d $f16, $f18, $f16 # $f16 = y[i][k] * z[k][j] add.d $f4, $f4, $f16 # f4=x[i][j] + y[i][k]*z[k][j] addiu $s2, $s2, 1 # $k k + 1 bne $s2, $t1, L3 # if (k != 32) go to L3 s.d $f4, 0($t2) # x[i][j] = $f4 addiu $s1, $s1, 1 # $j = j + 1 bne $s1, $t1, L2 # if (j != 32) go to L2 addiu $s0, $s0, 1 # $i = i + 1 bne $s0, $t1, L1 # if (i != 32) go to L1

Chapter 3 — Arithmetic for Computers — 40

- ARM: the most popular embedded core
- Similar basic set of instructions to MIPS

§2.16 Real Stuff: ARM Instructions

Chapter 2 — Instructions: Language of the Computer — 41

- Uses condition codes for result of an arithmetic/logical instruction
- Negative, zero, carry, overflow
- Compare instructions to set condition codes without keeping the result

- Each instruction can be conditional
- Top 4 bits of instruction word: condition value
- Can avoid branches over single instructions

Chapter 2 — Instructions: Language of the Computer — 42

Chapter 2 — Instructions: Language of the Computer — 43

- Evolution with backward compatibility
- 8080 (1974): 8-bit microprocessor
- Accumulator, plus 3 index-register pairs

- 8086 (1978): 16-bit extension to 8080
- Complex instruction set (CISC)

- 8087 (1980): floating-point coprocessor
- Adds FP instructions and register stack

- 80286 (1982): 24-bit addresses, MMU
- Segmented memory mapping and protection

- 80386 (1985): 32-bit extension (now IA-32)
- Additional addressing modes and operations
- Paged memory mapping as well as segments

- 8080 (1974): 8-bit microprocessor

§2.17 Real Stuff: x86 Instructions

Chapter 2 — Instructions: Language of the Computer — 44

- Further evolution…
- i486 (1989): pipelined, on-chip caches and FPU
- Compatible competitors: AMD, Cyrix, …

- Pentium (1993): superscalar, 64-bit datapath
- Later versions added MMX (Multi-Media eXtension) instructions
- The infamous FDIV bug

- Pentium Pro (1995), Pentium II (1997)
- New microarchitecture (see Colwell, The Pentium Chronicles)

- Pentium III (1999)
- Added SSE (Streaming SIMD Extensions) and associated registers

- Pentium 4 (2001)
- New microarchitecture
- Added SSE2 instructions

- i486 (1989): pipelined, on-chip caches and FPU

Chapter 2 — Instructions: Language of the Computer — 45

- And further…
- AMD64 (2003): extended architecture to 64 bits
- EM64T – Extended Memory 64 Technology (2004)
- AMD64 adopted by Intel (with refinements)
- Added SSE3 instructions

- Intel Core (2006)
- Added SSE4 instructions, virtual machine support

- AMD64 (announced 2007): SSE5 instructions
- Intel declined to follow, instead…

- Advanced Vector Extension (announced 2008)
- Longer SSE registers, more instructions

- If Intel didn’t extend with compatibility, its competitors would!
- Technical elegance ≠ market success

Chapter 2 — Instructions: Language of the Computer — 46

Chapter 2 — Instructions: Language of the Computer — 47

- Two operands per instruction

- Memory addressing modes
- Address in register
- Address = Rbase + displacement
- Address = Rbase + 2scale× Rindex (scale = 0, 1, 2, or 3)
- Address = Rbase + 2scale× Rindex + displacement

Chapter 2 — Instructions: Language of the Computer — 48

- Variable length encoding
- Postfix bytes specify addressing mode
- Prefix bytes modify operation
- Operand length, repetition, locking, …

Chapter 2 — Instructions: Language of the Computer — 49

- Complex instruction set makes implementation difficult
- Hardware translates instructions to simpler microoperations
- Simple instructions: 1–1
- Complex instructions: 1–many

- Microengine similar to RISC
- Market share makes this economically viable

- Hardware translates instructions to simpler microoperations
- Comparable performance to RISC
- Compilers avoid complex instructions

Chapter 2 — Instructions: Language of the Computer — 50

- Powerful instruction higher performance
- Fewer instructions required
- But complex instructions are hard to implement
- May slow down all instructions, including simple ones

- Compilers are good at making fast code from simple instructions

- Use assembly code for high performance
- But modern compilers are better at dealing with modern processors
- More lines of code more errors and less productivity

§2.18 Fallacies and Pitfalls

Chapter 2 — Instructions: Language of the Computer — 51

- Backward compatibility instruction set doesn’t change
- But they do accrete more instructions

x86 instruction set

Chapter 2 — Instructions: Language of the Computer — 52

- Sequential words are not at sequential addresses
- Increment by 4, not by 1!

- Keeping a pointer to an automatic variable after procedure returns
- e.g., passing pointer back via an argument
- Pointer becomes invalid when stack popped

Chapter 2 — Instructions: Language of the Computer — 53

- Design principles
1.Simplicity favors regularity

2.Smaller is faster

3.Make the common case fast

4.Good design demands good compromises

- Layers of software/hardware
- Compiler, assembler, hardware

- MIPS: typical of RISC ISAs
- c.f. x86

§2.19 Concluding Remarks

Chapter 2 — Instructions: Language of the Computer — 54

- Measure MIPS instruction executions in benchmark programs
- Consider making the common case fast
- Consider compromises

Chapter 2 — Instructions: Language of the Computer — 55