650 likes | 796 Views
CS433: Computer System Organization. Luddy Harrison Intel IA32 Architecture. History. The x86 / IA32 family. 8086 / 8088 (1978). 16-bit registers 16-bit external data bus (808 6 ) 8-bit external data bus (808 8 ) 20-bit address space via segment registers. Intel 286 (1982).
E N D
CS433: Computer System Organization Luddy Harrison Intel IA32 Architecture
History The x86 / IA32 family
8086 / 8088 (1978) • 16-bit registers • 16-bit external data bus (8086) • 8-bit external data bus (8088) • 20-bit address space via segment registers
Intel 286 (1982) • segment registers point to descriptor tables • descriptors have 24-bit segment addresses • segment swapping • protection • bounds checking on segments • read/execute/write checking • four privelege levels
Intel 386 (1985) • 32-bit registers (data and address) • virtual 8086 mode • 32-bit address bus • segmented memory model + flat memory model • paging with 4Kbyte pages • pipelined execution (decode + execution)
Intel 486 (1989) • five stage pipeline • 8Kb on-chip L1 cache • write-through • integrated x87 FPU • power management
Intel Pentium (1993) • two pipelines, u and v • superscalar execution • 8kb data + 8kb instruction on-chip L1 caches • write-back option in addition to write-through • branch prediction • burstable 64-bit external data bus • multiprocessor support • [second stepping: MMX]
Intel P6 (1995 – 1999) • Pentium Pro • Pentium II • Pentium II Xeon • Celeron • Pentium III • Pentium III Xeon
Pentium Pro • 3-way superscalar • out-of-order • more aggressive branch prediction • speculative execution • L1 + L2 cache on chip • 8K + 8K L1 • 256K L2
Pentium II • MMX (in P6 family) • 16K + 16K L1 caches • 256K, 512K, 1M L2 caches supported • improved power management
Pentium II Xeon • improved multiprocessor support • 4- and 8-way systems • 2Mb L2 cache on chip
Celeron • low-priced / reduced power market • 128K L2 cache • cheaper package (plastic)
Pentium III • Streaming SIMD Extensions (SSE) • 128-bit registers • floating point vector types
Pentium III Xeon • improved cache
Pentium 4 (2000) • return to Arabic numerals • NetBurst microarchitecture • SSE2 and SSE3
Pentium 4 Supporting Hyper-Threading Technology (2004) • marketing team abandons names in favor of entire sentences • Hyper-Threading is Simultaneous MultiThreading
Intel Xeon (2001-2004) • internal revolt against long name • recycled portion of old name(s) prevails • multiprocessor support • Was this the first Hyper-Threading IA32?
Intel Pentium M (2003) • The M is not a Roman Numeral • not “Pentium 1000” • refers to “Mobile” • low-power • integrated wireless support
Register Architecture The x86 / IA32 family
NetBurst MicroArchitecture (Pentium 4) • deep branch prediction • dynamic dataflow analysis • instructions translated into a risc-like form • these in turn are subject to out-of-order execution • speculative execution • up to 126 instructions in flight • up to 48 loads and 24 stores in pipeline • advanced branch predictor • 4K branch target buffer • execution trace cache stores decoded instructions • straightens code on the fly! • 8-way L2 cache • 64-byte cache line size • external bus capable of 6.4Gbytes per second
Front-End Pipeline • Prefetch • Fetch (on prefetch fail) • Decode into micro-operations • Generate microcode from complex operations • Delivers decoded instructions from execution trace cache • Branch prediction
IEEE 754 and IA32 • Kahan et al formulated the proper working of floating point hardware in a documented standard known as IEEE 754 • The x86 was designed to do all “scratch” calculations using a small floating point stack • the entries on the stack are 80-bit extended precision numbers • Unfortunately, this does not correspond well to the semantics of C
Example of Semantic Difference Between Natural x86 Execution and C Semantics double A, B, C, D, E, F, G; // set B=D=F and C=E=G and let B*C be very close to 0 in// extended precision, but exactly 0 in double precision. ... // suppose we use the x86 FP stack to do this RHS: A = B*C + D*E - F*G; // this yields zero in double but non-zero // in extended precision ... assert (A == 0.0);
Disadvantages of Far Pointers? • To dereference, two moves are necessary (in the general case): • move to segment register • move to address register • To compare, two comparisons are necessary • How do we compare ≤? • To store/load we must do two stores/loads • What about register allocation? • What other primitive types in programming languages are sometimes multi-word types (physically)?
BCD (Binary Coded Decimal) What is the idea behind BCD? What is this type trying to optimize?