1 / 32

CSCE 930 Advanced Computer Architecture

CSCE 930 Advanced Computer Architecture. Lecture 1 Evaluate Computer Architectures Dr. Jun Wang. Computer Architecture Trends. Figure 1.1 H&P Growth in microprocessor 35% per year. Technology Trends. Smaller feature sizes – higher speed, density. Density is increased by 77 times.

Download Presentation

CSCE 930 Advanced Computer Architecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.


Presentation Transcript

  1. CSCE 930 Advanced Computer Architecture Lecture 1 Evaluate Computer Architectures Dr. Jun Wang

  2. Computer Architecture Trends

  3. Figure 1.1 H&P Growth in microprocessor 35% per year

  4. Technology Trends • Smaller feature sizes – higher speed, density Density is increased by 77 times

  5. Technology Trends • Larger chips • Trend is toward more RAM, less logic per chip • Historically 2x per generation; leveling off? • McKinley has large on-chip caches => larger wafers to reduce fabricate costs

  6. Moore’s Law • Number of transistors doubles every 18 months (amended to 24 months) • Combination of both greater density and larger chips

  7. Tech. Trends, contd. • More, faster, cheaper transistors have fed an application demand for higher performance • 1970s -- serial, 1-bit integer microprocessors • 1980s -- pipelined 32-bit RISC • ISA simplicity allows processor on chip • 1990s -- large, superscalar processors, even for CISC • 2000s -- multiprocessors on a chip

  8. latch latch latch latch latch latch IF ID EX ME WB clock Pipelining and Branch Prediction • Two basic ways of increasing performance • Pipelining: • Branch Prediction • Speculate on branch outcome to avoid waiting

  9. Tech. Trend: memory sizes • Memories have grown very dense • Feeding application demand for large, complex software

  10. Tech. Trend: memory speeds • Main memory speeds have not kept up with processor speeds

  11. Memory Hierarchies • Gap between processor and memory performance has led to widespread use of memory hierarchies • 1960s no caches, no virtual memory • 1970s shared I & D-cache, 32-bit virtual memory • 1980s Split I- and D-caches • 1990s Two level caches, 64-bit virtual memory • 2000s Multi-level caches, both on and off-chip

  12. Memory Hierarchies Large/Slow Main Memory MEMORY SYSTEM L3 Cache L2 Cache PROCESSOR L1 Cache Registers Small/Fast

  13. I/O a key system component • I/O has evolved into a major distinguishing feature of computer systems • 1960s: disk, tape, punch cards, tty; batch processing • 1970s: character oriented displays • 1980s: video displays, audio, increasing disk sizes, beginning networking • 1990s: 3D graphics; networking a fundamental element; high quality audio • 2000s: real-time video, immersion…

  14. I/O Systems Proc • A hierarchy that divides bandwidth DRAM Controller interface • Data rates • Memory: 100 MHz, 8 bytes 800 MB/s (peak) • PCI: 33 MHz, 4 bytes wide  132 MB/s (peak) • SCSI: “Ultra2” (40 MHz), “Wide” (2 bytes)  80 MB/s (peak) Local Bus Interface High Speed I/O bus Frame Expansion Controller Buffer Hard Drives Monitor LAN Slow Speed I/O bus Floppy Floppy CD ROM

  15. . . . M M M InterconnectionNetwork C C C . . . P P P Multiprocessors • Multiprocessors have been available for decades… • 1960s small MPs • 1970s small MPs • Dream of automatic parallelization • 1980s small MPs; emergence of servers • Dream of automatic parallelization • 1990s expanding MPs • Very large MPPs failed • Dream of automatic parallelization fading • 2000s wide-spread MPs; on-chip multithreading • Many applications have independent threads • Programmers write applications to be parallel in the first place

  16. Evaluating Computer Architectures

  17. Computation Science • Computation is synthetic • Many of the phenomena in the computing field are created by humans rather than occurring naturally in the physical world • Very different from nature sciences »When one discovers a fact about nature, it is a contribution, no matter how small » Creating something new alone does not establish a contribution • Anyone can create something new in a synthetic field • Rather, one must show that the creation is better

  18. What Means “Better”? • “Better” can mean many things • Solves a problem in less time (faster) • Solves a larger class of problems (more powerful) • Is more efficient of resources (cheaper) • Is less prone to errors (more reliable) • Is easy to manage/program (lower human cost)

  19. Amdahl's Law • Defines the Speedup that can be gained by using a special feature • Speedup due to enhancement E: ExTime w/o E Performance w/ E Speedup(E) = ------------------- = ------------------------ ExTime w/ E Performance w/o E F E • Find how Speedup coming from some enhancement E • Suppose that enhancement E accelerates a fraction F of the task by a factor S, and the remainder of the task is unaffected

  20. Amdahl’s Law ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced Speedupenhanced 1 ExTimeold ExTimenew Speedupoverall = = (1 - Fractionenhanced) + Fractionenhanced Speedupenhanced

  21. The “better” property is not simply an observation • Rather, the research will postulate that a new idea • An architecture, algorithm, protocol, data structure, methodology, language, optimization or model, etc. • Will lead to a “better” result • Making the connection between the idea and the improvement is as important as quantifying how much the improvement is • The contribution is the idea, and is generally a component of a larger computational system.

  22. How to Evaluate Architecture Ideas • Measuring/observing/analyzing real systems • Accurate results • Need a working system » Too expensive to evaluate architecture/system ideas

  23. Analytic models • Fast & easy analysis of relations • Tprogram = NumOfInst (Tcpu+ Tm(1- Cachehit)) • Allows extrapolation to ridiculous parameters, e.g. thousands of processors • Sometimes infeasible to obtain accuracy (e.g. modeling caches) • To obtain reasonable accuracy, the models may become very complex (e.g. modeling of network contention) • Queuing theory is a commonly used technique

  24. Simulation • The most popular method in computer architecture and system research • Mimic the architecture/system using software • Very flexible: nearly unlimited evaluation • Prototyping of non-existing machines possible • Evaluation of design options (design space exploration) cheap & flexible • Requires some sort of validation • Can be VERY slow

  25. Tradeoff between accuracy and computational intensity • Low level of abstraction slow (e.g. simulating at the level of gates) • High level of abstraction fast (e.g. only simulating processor, cache and memory components) • Tradeoff may be intensified when modeling parallel architectures as multiple processors need to be simulated

  26. Three Simulation Techniques • Profile-based static modeling • Simplest and least costly • Use hardware counters on the chip or instrumented execution (such as Beowulf Linux cluster Pgprof, SGI perfex and Alpha ATOM) • Trace-driven • A more sophisticated technique • How it works (Ex. modeling memory system performance): • Collect traces generated by ATOM • Trace format: inst address executed, data address accessed • Build the memory hierarchy model • Feed trace in the simulation model and analyze results

  27. 1. Compile: pgcc –Mprof=func prg.cc 2. Run the code: to produce a profile data file called pgprof.out 3. View the execution profile: pgprof prprof.out

  28. Using Perfex • Usage : perfex [-e num] [-y] program [program args] -e num: count only event type num; -y: generate a “cost report”; • Example perfex –e 41 –13 –y a.out EVENT # Event Events Counted 41 Floating point OP retired 25292884493 13 L2 cache lines loaded 223490870 Statistics: MFLOPS 29.175907 Main memory  L2 bandwidth 8.249655 MB/s

  29. Execution–driven • The most accurate and most costly • Trace-driven can not simulate the interaction between memory system and processor • A detailed of the memory system and the processor pipeline are done simultaneously by really executing program on top of a simulation framework like Simics, SimOS and SimpleScalar

  30. Measuring by Means of Benchmarks • Micro-benchmarks (e.g. instruction latencies, file system throughput) • Application benchmarks: general system behavior (e.g. Spec2000 or SPLASH2) • Only limited evaluation possible (e.g. limited systems support for measurement) • The machine must be available • Benchmark Suites: Collection of kernels, real and benchmark programs, lessening the weakness of any one benchmark by the presence of others.

  31. Summarize Results • Weighted Arithmetic Mean Execution Time (Wi*Ti) • Summarize the products of weighting factors and execution times and reflect individual frequency of each workload • Wi = 1/(Timei * nj=1(1/Timej)) • Geometric Mean Execution Time (Ti/Ni)1/n • Normalize execution times to a reference machine and take the average of normalized execution times • Used by SPEC

  32. A Report Example (P&H figure 1.17)

More Related