COMP 2003: Assembly Language and Digital Logic

COMP 2003:Assembly Language and Digital Logic Chapter 7: Computer Architecture Notes by Neil Dickson

About This Chapter • This chapter delves deeper into the computer to give an understanding of the issues regarding the CPU, RAM, and I/O. • Having an understanding of the underlying architecture helps with writing efficient software.

Part 1 of 3: CPU Execution Pipelining and Beyond

Execution Pipelining Old systems:1 instruction at a time Less-old systems:multiple independent instructions at a time Fetch Fetch Instruction Decode Fetch Load Decode Fetch Decode Instruction Execute Load Decode Fetch Load Operand Values Store Execute Load Decode Execute Operation Store Execute Load Store Execute Store Results Store

A Hardware View of Pipelining Instruction 1 Instruction 2 Instruction 3 Instruction 4 Instruction 5 Instruction 6 Instruction 7 Instruction-Fetching Circuitry Problem:What if Instruction 1 stores result in eax (e.g. “mov eax,1”) and Instruction 2 needs to load eax? (e.g. “add ebx,eax”) Instruction Decoder(s) Operand-Loading Circuitry Execution Unit(s) Results-Storing Circuitry

Pipeline Dependencies Instruction 1 Instruction 2 Instruction 3 Instruction 4 Instruction 5 Instruction 6 Instruction 7 Instruction-Fetching Circuitry Suppose Instruction 1 stores result in eax and Instruction 2 needs to load eax. Have to wait here until result stored. Instruction Decoder(s) Operand-Loading Circuitry Execution Unit(s) Problem:What about conditional jumps? Results-Storing Circuitry

Branch Prediction • Suppose that Instruction 3 is a conditional jump(e.g. jc MyLabel) • The “operand” to load is the flags. • Its execution is to determine whether or not to jump (i.e. where to go next). • Its result is stored in the instruction pointer, eip. • Unknown what comes next until the execution, so the CPU makes a prediction first and checks it in the execution stage

Branch Prediction and the Pipeline Instruction 3 Instruction 4’ Instruction 6 Instruction 5 Instruction 1 Instruction 2 Instruction 4 so clear the pipeline and start from the new eip Suppose Instruction 3 is a conditional jump Instruction-Fetching Circuitry Instruction Decoder(s) Instruction 2 changed the flags, so wait here Operand-Loading Circuitry oh no! It turned out that the CPU guessed wrong. Execution Unit(s) Results-Storing Circuitry

Pipelining Pros/Cons • Pro: Only one set of each hardware component is needed (plus some hardware to manage) • Pro: Initial concept was simple • Con: Programmer/compiler must try to eliminate dependencies, which can be tough, else face big performance penalties • Con: The actual hardware can get complicated • Note: No longer short on CPU die space, so first Pro doesn’t matter much anymore

Beyond Pipelining • For jumps that are hard to predict, guess BOTH directions, and keep two copies of results based on the guess (e.g. 2 of each register) • Allow many instructions in at once (e.g. multiple decoders, multiple execution units, etc.) so that there’s a higher probability of more operations that can run concurrently • Vector instructions (multiple data together)

Intel Core i7 Execution Architecture 32KB Instruction Cache Branch Prediction L3 Cache 16-byte Prefetch Buffer L2 Cache Initial (Length) Decoder Queue of ≤18 Instructions from RAM split instructions into parts called “MicroOps” 4 Decoders 2 Copies of Registers Buffer of ≤128 MicroOps Store to Memory Load from Memory Several, 128-bit Execution Units 32KB Data Cache

What About Multiple Cores? • What we’ve looked at so far is a single CPU core’s execution. • A CPU core is a copy of this functionality on the CPU die, so a quad-core CPU has 4 copies of everything shown (except larger caches). • Instead of trying to run multiple instructions from the same stream of code concurrently, as before, each core runs independently of any others (one thread on each)

Confusion About Cores • “Cores” in GPUs and custom processors like the Cell are not independent, whereas cores in standard CPUs are, so this has led to great confusion and misunderstanding. • The operating system decides what instruction stream (thread) to run on each CPU core, and can periodically change this (thread scheduling) • These issues are not part of this course, but may be covered in a parallel computing or operating systems course.

Part 2 of 3: Memory Caches and Virtual Memory

Memory Caches • Caches are copies of RAM on the CPU to save time • A cache miss is when one checks a cache for a piece of memory that is not there • Larger caches have fewer misses, but are slower, so modern CPUs have multiple levels of cache: • Memory Buffers (ignored here), L1 Cache, L2 Cache, L3 Cache, RAM • CPU only accesses memory through cache under normal circumstances

Reading From Cache • want value of memory at location A • if A is not in L1 • if A is not in L2 • if A is not in L3 • L3 reads A from RAM • L2 reads A from L3 • L1 reads A from L2 • read A from L1 • Note: A is now in all levels of cache

Writing to Cache • want to store value into memory at location A • write A into L1 • after time delay, L1 writes A into L2 • after time delay, L2 writes A into L3 • after time delay, L3 writes A into RAM • Note: the time delays could result in concurrency issues in multi-core CPUs, so write caching can get more complicated

Caching Concerns • Randomly accessing memory causes many more cache misses than sequentially accessing memory or accessing relatively few locations • This is how quicksort is usually not so quick compared to mergesort • Writing to a huge block of memory that won’t be read soon can cause cache misses later, since it fills up caches with the written data • There are special instructions to indicate not to cache certain writes, avoiding this in assembly

Paging • Paging, a.k.a. virtual memory mapping, is a feature of CPUs that allows the apparent rearrangement of physical memory blocks into one or more virtual memory spaces. • 3 main reasons for this: • Programs can be in separate memory spaces, so they don’t interfere with each other • The OS can give the illusion of more memory using the hard drive • The OS can prevent programs from messing up the system (accidentally or intentionally)

Virtual Memory • With a modern OS, no memory accesses by a program directly access physical memory • Virtual addresses are mapped to physical addresses in 4KB or 2MB pages using page tables, set up by the OS.

Page Tables 3 4 5 6 7 3 4 virtual page #: 0 1 2 virtual page #: 0 1 2 5 6 7 page table for Dude.exe: page table for Sweet.exe: ... ... physical memory: ... physical page #: 0 1 2 3 4 5 6 7 8 9 A B C D E F

Part 3 of 3: I/O and Interrupts Just an Overview

Common I/O Devices • Human Interface (what most people think of) • Keyboard, Mouse, Microphone, Speaker, Display, Webcam, etc. • Storage • Hard Drive, Optical Drive, USB Key, SD Card • Adapters • Network Card, Graphics Card • Timers (very important for software) • PITs, LAPIC Timers, CMOS Timer

If There’s One Thing to Remember • I/O IS SLOW! • Bad Throughput: • Mechanical drives can transfer up to 127MB/s • Memory bus can transfer up to 30,517 MB/s(or more for modern ones) • Very Bad Latency: • 10,000 RPM drive average latency: 3,000,000ns • 1333MHz uncached memory average latency: 16ns

I/O Terminology • I/O Ports or Memory-Mapped I/O? • Some devices are controlled through special “I/O ports” accessible with the “in” and “out” instructions. • Some devices make themselves controllable by occupying blocks of memory and intercepting any reads or writes to that memory instead of using “in” and “out”. This is often called Direct Memory Access (DMA).

I/O Terminology • Programmed I/O or Interrupt-Driven I/O? • Programmed I/O is controlling a device’s “operation” step-by-step with the CPU • Interrupt-Driven I/O involves the CPU setting up some “operation” to be done by a device and getting “notified” by the device when the “operation” is done • Most I/O in a modern system is interrupt-driven

Interrupts • Instead of continuously checking for keyboard or mouse input, can be notified of it when it happens • Instead of waiting idly for the hard drive to finish writing data, can do other work and be notified when it’s done • Such a notification is called an I/O interrupt. • (There are also exception interrupts e.g. for when doing an integer division by zero.)

Interrupts • When an interrupt occurs, the CPU stops what it was doing and calls a function specified by the OS to handle the interrupt. • This function is an interrupt handler • The interrupt handler deals with the I/O operation (e.g. saves a typed key) and returns, resuming whatever was interrupted • Because interrupts can occur at any time, values on the stack below esp may change at any time

COMP 2003: Assembly Language and Digital Logic