Christopher Foster Scott Thibaudeau Brian Cleary
Itanium – IA-64: Overview. • Development of the Parallel Processor • Success and Failure (Problems and Solutions) • Multiple Parallel Pipelines on a Single Die • Itanium is born! • Execution of Parallel Processing in IA-64 • 10 deep pipeline execution; 9 Parallel distribution sites • Current and future IA-64 code Development • The Memory Requirements and Specifications • Heirarchy: Registers, L1,2,3 Cache, Main Memory, HD • L1=Data; L2=Unified; L3=Off-Chip: Fully Associative • Latency Times • Full Memory Block Diagram Overview • System Management Bus (SM Bus) • Thermal System • EEPROM, PIROM • System Bus (IA-64 Bus Architecture) • Bandwidth • Parallel Processors in Parallel • SAC, SDC (Controls access to the bus)
History of Microprocessors:A Very Abridged Tour. • Beginning of time: Circa 1980 and before… • CISC and RISC Computers are all that exist. • Zilog 6502 Lives in every house (Nintendo). • Ronald Regan in office. • Middle Ages: Circa 1990 • Parallel Processing exists in white-papers. • IA-32 is in almost every desktop. • Vanilla Ice hits it big. • Current Day: Circa 2000 • Beowulf Clusters (Distributed Parallel Processing Networks) • Pentium breaks the GHz mark with IA-32. • Intel develops the IA-64 Architecture to support Parallel on die.
So what’s so good about Parallelism? At the most efficient each parallel path divides the execution time IN HALF! • This leads to incredible gains: • Productivity (Reduced Latency) • Wait times for compile/execute • Increased functionality in real-time processes • Reliability (Redundancy) • Multiple modules for eachfunctional unit • Security (Locality) • All processors in one place (physically) • Encryption power increased • Scalability (Modular reuse)
But are there any disadvantages? • YES: • Memory Size/Latency • Branch Prediction • Independent Instructions
IA-64 Solves All of these problems: • Memory Size: 64 bit addressing | Huge Register File • Memory Latency: Multiple Layers of Cache • Branch Prediction: Hardware Solution • Independent Instructions: *New code classes* And with these problems out of the way…
The way is prepared for:Multiple Parallel Processes on a Single Die:Explicitly Parallel Instruction Processing (EPIC) • With resources made available, the Itanium is able to use multiple • functional units for each process required. • This results in an incredible number of • separate pipelined execution paths: • Integer Function Units (2) • Memory Units (2) • Branch Prediction Units (3) • Floating Point Units (2) + • Total 9 separate execution paths! Note: Though the focus is not on pipelining here, there are 10 deep pipelines for each unit.
Fetch/Distribution Procedures 3 instructions per bundle 2 bundles per clock x Fully 6 instructions per clock. M0, M1, I0, I1, F0, F1, B0, B1, B2 These are all execution pipelines. M=Memory Units F=Floating Point Units I=Integer Units B=Branch
How do we write code for The Itanium? • *NEW Code Classes* • Allow programmer to specify specific function units for: • Loads, Arithmetic, Branch Ops, Logic Operations • Enable users to specify INDEPENDENT INSTRUCTIONS • Interpretation at OS Level: • Windows 64 (to be released as Windows XP64); • Linux-64, HP-UX, Modesto; • PAL Level interpretation • Possibility of Virtual Machine interface.
Is Itanium Fully Developed? No. • Some registers yet to be named and used. • Windows 64 not yet available. • Cost of processor/memory production still too high. And they haven’t written any books on the subject yet either. Moore’s Law: If we keep doubling, then we can expect IA-64 to be around half as long as IA-32. That’s about 5-7 years. That gives us at least 3 more.
Register File • 256 general and floating point registers • 64-bits wide • Rotating registers
Memory Hierarchy • Level 1 Data Cache (L1-D) • Level 1 Instruction Cache (L1-I) • 16Kb, 4-way set associative with 32-byte lines • Level 2 Unified Cache (L2) • Level 3 Cache (L3) • Main Memory (FSB) Bus • Maximum Bandwidth of 2.1GB/s. • Level 1 & Level 2 Data Translation Lookaside Buffers (L1/L2-DTLB) • Instruction Translation Cache (ITLB)
Level 1 Data Cache (L1-D) • 16 Kb, 4-way set associative, write through, no write allocate with 32-byte lines • Integer loads have 2-cycle latency • Floating Point loads bypass L1 Data cache
Level 2 Unified Cache (L2) • 96Kb, 6-way set associative, write back and write allocate with 64-byte lines • Integer loads have 6-cycle latency • Floating Point have 9-cycle latency
L3 Cache (L3)??? • Off-chip • 2Mb or 4Mb package • Maximum bandwidth from L3 to L2 is 16 bytes times the core frequency • Integer loads have 21-cycle latency • Floating Point have 24-cycle latency So what?
L1 & L2 Data Translation Lookaside Buffer • 32 & 96 entries, respectively • Both fully associative • Both support page sizes of 4k, 8k, 16k, 64k, 256k, 1M, 4M, 16M, 64M, and 256M • Purges supported include all page sizes and 4G
Instruction Translation Cache • Single-level instruction • 64 entries • Fully associative
IA-64 Thermal Specifications • What are the components? How does it work? • Internal thermal circuit w/ thermal sensing diode • How does it protect itself from overheating? • Comparison to THIGH • What happens when overheating occurs? • Thermal Alert Register tripped • To restore… • What exactly are the heat tolerances? What should be calculated? Any equations? • According to Intel…
IA-64 Thermal Specifications: The Processors • What about the AMD/P4/P3? • P4: Application Slows Down (Itanium inherits fundamental heat protection) • P3: Application Freezes • As for the AMD… • Video displaying above characteristics at end of presentation
IA-64 System Management Bus (w/Thermal Sensory) • Why do we care about the PIROM and EEPROM? • EEPROM is a read write memory block that enables vendors to specify methods/standards as to how data is transferred in the data bus. • PIROM contains write-protected information regarding certain characteristics of the processor (frequency speed). • As for the thermal sensor, in conjunction with the above components, accurate temperature checking/regulation is achieved.
IA-64 System Management Bus: Data/Addressing Management • Packet Types (Read/Write) • Memory Units: current address read, random access read, sequential read, byte write, page write • Thermal Unit: write byte, read byte, send byte, receive byte, ARA • Addressing • Memory Units: “1010XXY2b” • Thermal Unit: “0011XXXZb” “1001XXXZb” “0101XXXZb”
IA-64 Main Bus Architecture:Specifications • 64-Bit bus running at 2.1 GB/s • Up to  Itaniums can be connected in parallel to the same bus (running at 266 Mhz) • SAC: System Address Controller • SDC: System Data Controller • Above controllers assign Address or Data Information from the Itanium(s) to the memory unit (from multiple processors to a single bus line and vice versa)
IA-64 Customer Feedback • What are journalists, customers saying? - “The heat generated from the Itanium can be compared to an EZ-Bake Oven…Intel is losing its foothold in the processor industry by relying on the archaic x86 architecture.” - “Upgrading a mission critical system is a daunting task, especially since there exists reliable 64-bit Unix Machines. Then there’s the code conversion problem…”