1 / 25

MAMAS – Computer S tructure 234267

MAMAS – Computer S tructure 234267. Lecturers: Lihu Rappoport Adi Yoaz. Some of the slides were taken from Avi Mendelson , Randi Katz, Patterson, Gabriel Loh. General Course Information. Grade 20% Exercise (mandatory) תקף 80% Final exam No midterm exam Course web site

hanne
Download Presentation

MAMAS – Computer S tructure 234267

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MAMAS – Computer Structure234267 Lecturers: Lihu Rappoport Adi Yoaz Some of the slides were taken from Avi Mendelson, Randi Katz, Patterson, Gabriel Loh

  2. General Course Information • Grade • 20% Exercise (mandatory) תקף • 80% Final exam • No midterm exam • Course web site • http://webcourse.cs.technion.ac.il/234267 • Foils will be on the web several days before the class

  3. Class Focus • CPU • Introduction: performance, instruction set (RISC vs. CISC) • Pipeline, hazards • Branch prediction • Out-of-order execution • Memory Hierarchy • Cache • Main memory • Virtual Memory • Advanced Topics • PC Architecture • Motherboard & chipset, DRAM, I/O, Disk, peripherals

  4. Computer System – Sandy Bridge External Graphics Card PCI express ×16 DDRIII Channel 1 Memory controller Cache GFX System Agent Mem BUS Core DDRIII Channel 2 Core Display link HDMI IO Controller South Bridge (PCH) PCI express ×1 USB controller SATA controller SATA controller PCI Lan Adap Sound Card Floppy Drive keybrd mouse DVD Drive Hard Disk Parallel Port Serial Port speakers LAN

  5. Architecture & Microarchitecture • ArchitectureThe processor features seen by the “user” • Instruction set, addressing modes, data width, … • Micro-architectureThe way of implementation of a processor • Caches size and structure, number of execution units, … • Timing is considered uArch (though it is user visible) • Processors with different uArch can support the same Architecture

  6. Compatibility • Backward compatibility • New hardware can run existing software • Core2 Duo can run SW written for Pentium4, PentiumM, Pentium III, Pentium II, Pentium, 486, 386, 268 • Forward compatibility • New software can run on existing hardware • Example: new software written with SSE2TM runs on older processor which does not support SSE2TM • Commonly supports one or two generations behind • Architecture independent SW • JIT – just in time compiler: Java and .NET • Binary translation

  7. Moore’s Law The number of transistors doubles every ~2 years

  8. #cycles required to execute the program IC CPI = CPI – Cycles Per Instruction • CPUs work according to a clock signal • Clock cycle is measured in nsec (10-9 of a second) • Clock frequency (= 1/clock cycle) measured in GHz (109 cyc/sec) • Instruction Count (IC) • Total number of instructions executed in the program • CPI – Cycles Per Instruction • Average #cycles per Instruction (in a given program) • IPC (= 1/CPI) : Instructions per cycles

  9. Calculating the CPI of a Program • ICi: #times instruction of type i is executed in the program • IC: #instruction executed in the program: • Fi: relative frequency of instruction of type i : Fi = ICi/IC • CPIi – #cycles to execute instruction of type i • e.g.: CPIadd = 1, CPImul = 3 • #cycles required to execute the entire program: • CPI:

  10. CPU Time • CPU Time - time required to execute a program CPU Time = IC CPI  clock cycle • Our goal: minimize CPU Time • Minimize clock cycle: more GHz (process, circuit, uArch) • Minimize CPI: uArch (e.g.: more execution units) • Minimize IC: architecture (e.g.: SSETM)

  11. Fractionenhanced t’exe= texe× (1 – Fractionenhanced) + Speedupenhanced texe t’exe 1 = Speedupoverall= Fractionenhanced (1 - Fractionenhanced) + Speedupenhanced Amdahl’s Law Suppose enhancement E accelerates a fraction F of the task by a factor S, and the remainder of the task is unaffected, then: texe t’exe

  12. 1 Speedupoverall = = 1.053 0.95 Amdahl’s Law: Example • Floating point instructions improved to run at 2×, but only 10% of executed instructions are FP t’exe= texe× (0.9 + 0.1 / 2) = 0.95 × texe Corollary: Make The Common Case Fast

  13. Comparing Performance • Peak Performance • MIPS, MFLOPS • Often not useful: unachievable / unsustainable in practice • Benchmarks • Real applications, or representative parts of real apps • Targeted at the specific system usages • SPEC INT – integer applications • Data compression, C complier, Perl interpreter, database system, chess-playing, Text-processing, … • SPEC FP – floating point applications • Mostly important scientific applications • TPC Benchmarks • Measure transaction-processing throughput

  14. Bad S-curve Positive outliers Negative outliers Evaluating Performance of future CPUs • Use a performance simulator to evaluate the performance of a new feature / algorithm • Models the uarch to a great detail • Run 100’s of representative applications • Produce the performance s-curve • Sort the applications according to the IPC increase • Baseline (0) is the processor without the new feature Good S-curve Positive outliers Small negative outliers

  15. software instruction set hardware Instruction Set Design The ISA is what the user / compiler see The HW implements the ISA

  16. ISA Considerations • Reduce the IC to reduce execution time • E.g., a single vector instruction performs the work of multiple scalar instructions • Simple instructions  simpler HW implementation • Higher frequency, lower power, lower cost • Code size • Long instructions take more time to fetch • Longer instructions require a larger memory • Important in small devices, e.g., cell phones

  17. Int. Avg. FP Avg. 30% 20% 10% 0% 3 4 5 6 7 8 9 1 2 0 12 10 11 13 14 15 Immediate data bits Architectural Consideration Example Immediate data size • 1% of data values > 16-bits • 12 – 16 bits of needed

  18. CISC Processors • CISC – Complex Instruction Set Computer • The idea: a high level machine language • Example: x86 • Characteristic • Many instruction types, with a many addressing modes • Some of the instructions are complex • Execute complex tasks • Require many cycles • ALU operations directly on memory • Only a few registers, in many cases not orthogonal • Variable length instructions • common instructions get short codes  save code length

  19. Rank instruction % of total executed 1 load 22% 2 conditional branch 20% 3 compare 16% 4 store 12% 5 add 8% 6 and 6% 7 sub 5% 8 move register-register 4% 9 call 1% 10 return 1% Total 96% Top 10 x86 Instructions Simple instructions dominate instruction frequency

  20. CISC Drawbacks • Complex instructions and complex addressing modes complicates the processor  slows down the simple, common instructions  contradicts Make The Common Case Fast • Not compiler friendly • Non orthogonal registers • Unused complex addressing modes • Variable length instructions are a pain • Difficult to decode few instructions in parallel • As long as instruction is not decoded, its length is unknown  Unknown where the inst. ends, and where the next inst. starts • An instruction may cross a cache line or a page

  21. RISC Processors • RISC - Reduced Instruction Set Computer • The idea: simple instructions enable fast hardware • Characteristics • A small instruction set, with few instruction formats • Simple instructions that execute simple tasks • Most of them require a single cycle (with pipeline) • A few indexing methods • ALU operations on registers only • Memory is accessed using Load and Store instructions only • Many orthogonal registers • Three address machine: Add dst, src1, src2 • Fixed length instructions

  22. RISC Processors (Cont.) • Simple architecture  Simple micro-architecture • Simple, small and fast control logic • Simpler to design and validate • Leave space for large on die caches • Shorten time-to-market • Using a smart compiler • Better pipeline usage • Better register allocation • Existing RISC processor are not “pure” RISC • e.g., support division which takes many cycles • Examples: MIPSTM, SparcTM, AlphaTM, PowerTM

  23. Compilers and ISA • Ease of compilation • Orthogonality: • no special registers • few special cases • all operand modes available with any data type or instruction type • Regularity: • no overloading for the meanings of instruction fields • streamlined • resource needs easily determined • Register Assignment is critical too • Easier if lots of registers

  24. CISC Is Dominant • The x86 architecture, which is a CISC architecture, dominates the processor market • A vast amount of existing software • Intel, AMD, Microsoft and others benefit from this • Intel and AMD put a lot of money to make high performance x86 processors, despite the architectural disadvantage • Current x86 processor give the best cost/performance • CISC processors use arch ideas from the RISC world • Starting at Pentium II and K6, x86 processors translate CISC instructions into RISC-like operations internally • the inside core looks much like that of a RISC processor

  25. 128-bits 128-bits x3 x3 x2 x2 x1 x1 x0 x0 + + y3 y3 y2 y2 y1 y1 y0 y0 x2+y2 y2 x1+y1 y1 x0+y0 x0+y0 x3+y3 y3 Software Specific Extensions • Extend arch to accelerate exec of specific apps • Example: SSETM – Streaming SIMD Extensions • 128-bit packed (vector) / scalar single precision FP (4×32) • Introduced on Pentium® III on ’99 • 8 new 128 bit registers (XMM0 – XMM7) • Accelerates graphics, video, scientific calculations, … • Packed: Scalar:

More Related