CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”

Herbert G. Mayer, PSU CS Status 10/9/2012 CS 201Computer Systems ProgrammingChapter 3“Architecture Overview”

Syllabus • Computing History • Evolution of Microprocessor µP Performance • Processor Performance Growth • Key Architecture Messages • Code Sequences for Different Architectures • Dependencies, AKA Dependences • Score Board • References

Computing History Before 1940 1643 Pascal’s Arithmetic Machine About 1660 Leibnitz Four Function Calculator 1710 -1750 Punched Cards by Bouchon, Falcon, Jacquard 1810 Babbage Difference Engine, unfinished; 1st programmer ever in the world was poet Lord Byron’s daughter, after whom the language Ada was named: Lady Ada Lovelace 1835 Babbage Analytical Engine, also unfinished 1920 Hollerith Tabulating Machine to help with census in the USA

Computing History Decade of 1940s 1939 – 1942 John Atanasoff built programmable, electronic computer at Iowa State University 1936 - 1945 Konrad Zuse’s Z3 and Z4, early electro-mechanical computers based on relays; colleague advised use of “vacuum tubes” 1946 John von Neumann’s computer design of stored program 1946 Mauchly and Eckert built ENIAC, modeled after Atanasoff’s ideas, built at University of Pennsylvania: Electronic Numeric Integrator and Computer, 30 ton monster 1980s John Atanasoff got acknowledgment and patent officially 

Computing History Decade of the 1950s Univac Uniprocessor based on ENIAC, commercially viable, developed by John Mauchly and John Presper Eckert Commercial systems sold by Remington Rand Mark III computer Decade of the 1960s IBM’s 360 family co-developed with GE, Siemens, et al. Transistor replaces vacuum tube Burroughs stack machines, compete with GPR architectures All still von Neumann architectures 1969 ARPANET Cache and VMM developed, first at Manchester University

Computing History Decade of the 1970s Birth of Microprocessor at Intel, see Gordon Moore High-end mainframes, e.g. CDC 6000s, IBM 360 + 370 series Architecture advances: Caches, VMM ubiquitous, since real memories were expensive Intel 4004, Intel 8080, single-chip microprocessors Programmable controllers Mini-computers, PDP 11, HP 3000 16-bit computer Height of Digital Equipment Corp. (DEC) Birth of personal computers, which DEC misses

Computing History Decade of the 1980s decrease of mini-computer use 32-bit computing even on minis Architecture advances: superscalar, faster caches, larger caches Multitude of Supercomputer manufacturers Compiler complexity: trace-scheduling, VLIW Workstations common: Apollo, HP, DEC’s Ken Olsen trying to catch up, Intergraph, Ardent, Sun, Three Rivers, Silicon Graphics, etc.

Computing History Decade of the 1990s Architecture advances: superscalar & pipelined, speculative execution, ooo execution Powerful desktops End of mini-computer and of many super-computer manufacturers Microprocessor powerful as early supercomputers Consolidation of many computer companies into a few large ones End of Soviet Union marked the end of several supercomputer companies

Evolution of µP Performance(by: James C. Hoe @ CMU)

Processor Performance Growth Moore’s Law --from Webopedia 8/27/2004: “The observation made in 1965 by Gordon Moore, co-founder of Intel, that the number of transistors per square inch on integrated circuits had doubled every year since it was invented. Moore predicted that this trend would continue for the foreseeable future. In subsequent years, the pace slowed down a bit, but data density doubled approximately every 18 months, and this is the current definition of Moore's Law, which Moore himself has blessed. Most experts, including Moore himself, expect Moore's Law to hold for at least another two decades. Others coin a more general law, stating that “the circuit density increases predictably over time.”

Processor Performance Growth So far in 2012, Moore’s Law is holding true since ~1968. Some Intel fellows believe that an end to Moore’s Law will be reached ~2018 due to physical limitations in the process of manufacturing transistors from semi-conductor material. This phenomenal growth is unknown in any other industry. For example, if doubling of performance could be achieved every 18 months, then by 2001 other industries would have achieved the following: cars would travel at 2,400,000 Mph, and get 600,000 MpG Air travel from LA to NYC would be at 36,000 Mach, or take 0.5 seconds

Message 1: Memory is Slow The inner core of the processor, the CPU or the µP, is getting faster at a steady rate Access to memory is also getting faster over time, but at a slower rate. This rate differential has existed for quite some time, with the strange effect that fast processors have to rely on slow memories Not uncommon on MP server that processor has to wait >100 cycles before a memory access completes. On a Multi-Processor the bus protocol is more complex due to snooping, backing-off, arbitration, etc., thus the number of cycles to complete an access can grow so high

Message 1: Memory is Slow Discarding conventional memory altogether, relying only on cache-like memories, is NOT an option, due to the price differential between cache and regular RAM Another way of seeing this: Using solely reasonably-priced cache memories (say at <= 10 times the cost of regular memory) is not feasible: resulting address space would be too small Almost all intellectual efforts in computer architecture focus on reducing the performance impact of fast processors accessing slow memories All else seems easy compared to this fundamental problem!

Message 1: Memory is Slow µProc 60%/yr. 1000 CPU “Moore’s Law” 100 Processor-Memory Performance Gap:(grows 50% / year) 10 DRAM 7%/yr. DRAM 1 1980 1981 1982 1983 1984 1985 1986 1987 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 1988 2002 Time Source: David Patterson, UC Berkeley

Message 2: Events Tend to Cluster A strange thing happens during program execution: Seemingly unrelated events tend to cluster memory accesses tend to concentrate a majority of their referenced addresses onto a small domain of the total address space. Even if all of memory is accessed, during some periods of time this phenomenon is observed. Here one memory access seems independent of another, but they both happen to fall onto the same page (or working set of pages) We call this phenomenon Locality! Architects exploit locality to speed up memory access via Caches and increase the address range beyond physical memory via Virtual Memory Management. Distinguish spacial versus temporal locality

Message 2: Events Tend to Cluster Similarly, hash functions tend to concentrate an unproportionally large number of keys onto a small number of table entries Incoming search key (say, a C++ program identifier) is mapped into an index, but the next, completely unrelated key, happens to map onto the same index. In an extreme case, this may render a hash lookup slower than a sequential search Programmer must watch out for the phenomenon of clustering, as it is undesired in hashing!

Message 2: Events Tend to Cluster Clustering happens in all diverse modules of the processor architecture. For example, when a data cache is used to speed-up memory accesses by having a copy of frequently used data in a faster memory unit, it happens that a small cache suffices Due to Data Locality (spatial and temporal). Data that have been accessed recently will again be accessed in the near future, or at least data that live close by will be accessed in the near future Thus they happen to reside in the same cache line. Architects do exploit this to speed up execution, while keeping the incremental cost for HW contained. Here clustering is a valuable phenomenon

Message 3: Heat is Bad Clocking a processor fast (e.g. > 3-5 GHz) increases performance and thus generally “is good” Other performance parameters, such as memory access speed, peripheral access, etc. do not scale with the clock speed. Still, increasing the clock to a higher rate is desirable Comes at the cost of higher current and thus more heat generated in the identical physical space, the geometry (the real-estate) of the silicon processor or chipset But Silicon part acts like a heat-conductor, conducting better, as it gets warmer (negative temperature coefficient resistor, or NTC). Since the power-supply is a constant-current source, a lower resistance causes lower voltage, shown as VDroop in the figure below

Message 3: Heat is Bad

Message 3: Heat is Bad This in turn means, voltage must be increased artificially, to sustain the clock rate, creating more heat, ultimately leading to self-destruction of the part Great efforts are being made to increase the clock speed, requiring more voltage, while at the same time reducing heat generation. Current technologies include sleep-states of the Silicon part (processor as well as chip-set), and Turbo Boost mode, to contain heat generation while boosting clock speed just at the right time Good that to date Silicon manufacturing technologies allow the shrinking of transistors and thus of whole dies. Else CPUs would become larger, more expensive, and above all: hotter.

Message 4: Resource Replication Architects cannot increase clock speed beyond physical limitations One cannot decrease the die size beyond evolving technology Yet speed improvements are desired, and achieved This conflict can partly be overcome with replicated resources! But careful!

Message 4: Resource Replication Key obstacle to parallel execution is data dependence in the SW under execution. A datum cannot be used, before it has been computed Compiler optimization technology calls this use-def dependence (short for use-definition dependence), AKA true dependence, AKA data dependence Goal is to search for program portions that are independent of one another. This can be at multiple levels of focus:

Message 4: Resource Replication At the very low level of registers, at the machine level –done by HW At the low level of individual machine instructions –done by HW At the medium level of subexpressions in a program –done by compiler At the higher level of distinct statements in a high-level program –done by optimizing compiler or by programmer Or at the very high level of different applications, running on the same computer, but with independent data, separate computations, and independent results –done by the user

Message 4: Resource Replication Whenever program portions are independent of one another, they can be computed at the same time: in parallel Architects provide resources for this parallelism Compilers need to uncover opportunities for parallelism If two actions are independent of one another, they can be computed simultaneously Provided that HW resources exist, that the absence of dependence has been proven, and that the independent execution paths are scheduled on these replicated HW resources! Generally this is a complex undertaking!

Code 1 for Different Architectures Example 1: Object Code Sequence Without Optimization Strict left-to-right translation, no smarts in mapping Consider non-commutative subtraction and division operators No common subexpression elimination (CSE), and no register reuse Conventional operator precedence For Single Accumulator SAA, Three-Address GPR, Stack Architectures Sample source: d  ( a + 3 ) * b - ( a + 3 ) / c

Code 1 for Different Architectures

Code 1 for Different Architectures Three-address code looks shortest, w.r.t. number of instructions Maybe optical illusion, must also consider number of bits for instructions Must consider number of I-fetches, operand fetches Must consider total number of stores Numerous memory accesses on SAA due to temporary values held in memory Most memory accesses on SA, since everything requires a memory access Three-Address architecture immune to commutativity constraint, since operands may be placed in registers in either order Important architectural feature? Only if SW cannot handle this; compiler can No need for reverse-operation opcodes for Three-Address architecture Decide in Three-Address architecture how to encode operand types Numerous stack instructions, i.e. many bits for opcodes, since each operand fetch is separate instruction

Code 2 for Different Architectures This time we eliminate common subexpression Compiler handles left-to-right order for non-commutative operators on SAA Better code for: d = ( a+3 ) * b - ( a+3 ) / c

Code 2 for Different Architectures

Code 2 for Different Architectures Single Accumulator Architecture (SAA) optimized still needs temporary storage; uses temp1 for common subexpression; has no other register!! SAA could use negate instruction or reverse subtract Register-use optimized for Three-Address architecture; but dup and xch are newly added instructions Common subexpresssion optimized on Stack Machine by duplicating, exchanging, etc. 20% reduced for Three-Address, 18% for SAA, only 8% for Stack Machine

Code 3 for Different Architectures Analyze similar source expressions but with reversed operator precedence One operator sequence associates right-to-left, due to precedence Compiler uses commutativity The other left-to-right, due to explicit parentheses Use simple-minded code model: no cache, no optimization Will there be advantages/disadvantages due to architecture? Expression 1 is : e  a + b * c ^ d

Code 3 for Different Architectures • Expression 1 is : e  a + b * c ^ d Expression 1 is : e  a + b * c ^ d

Code 3 for Different Architectures • Expression 2 is : f  ( ( g + h ) * i ) ^ j

Code For Stack Architecture Stack Machine with no register inherently slow: Memory Accesses!!! Implement few top of stack elements via HW shadow registers  Cache Measure equivalent code sequences with/without consideration for cache Top-of-stack register tos points to last valid word on physical stack Two shadow registers may hold 0, 1, or 2 true top words Top of stack cache counter tcc specifies number of shadow registers in use Thus tos plus tcc jointly specify true top of stack

Code For Stack Architecture

Code For Stack Architecture Timings for push, pushlit, add, pop operations depend on tcc Operations in shadow registers fastest, typically 1 cycle, include register access and the operation itself Generally, further memory access adds 2 cycles For stack changes use some defined policy, e.g. keep tcc 50% full Table below refines timings for stack with shadow registers Note: push x into cache with free space requires 2 cycles: cache adjustment is done at the same time as memory fetch

Code For Stack Architecture

Code For Stack Architecture Code emission for: a + b * c ^ ( d + e * f ^ g ) Let + and * be commutative, by language rule Architecture here has 2 shadow registers, compiler exploits this Assume initially empty 2-word cache

Code For Stack Architecture # 1 Left - to - Right cycles 1 2 Exploit Cache cycles 2 1 push a 2 push f 2 2 push b 2 push g 2 3 push c 4 e xpo 1 4 push d 4 push e 2 5 push e 4 m ult 1 6 push f 4 push d 2 7 push g 4 a dd 1 8 expo 1 push c 2 9 mult 3 r_ e xpo = swap + expo 1 10 add 3 push b 2 11 expo 3 m ult 1 12 m ult 3 push a 2 13 a dd 3 a dd 1

Code For Stack Architecture Blind code emission costs 40 cycles; i.e. not taking advantage of tcc knowledge: costs performance Code emission with shadow register consideration costs 20 cycles True penalty for memory access is worse in practice Tremendous speed-up always possible when fixing system with severe flaws Return of investment for 2 registers is twice the original performance Such strong speedup is an indicator that the starting architecture was poor Stack Machine can be fast, if purity of top-of-stack access is sacrificed for performance Note that indexing, looping, indirection, call/return are not addressed here

Register Dependencies Inter-instruction dependencies, also known as dependences, arise between registers being defined and used One instruction computes a result into a register (or memory), another instruction needs that result from the register (or that memory location) Or, one instruction uses a datum, only after this use that same datum may it be recomputed

Register Dependencies True Dependence, AKA Data Dependence: r3 ← r1 op r2 r5 ← r3 op r4 Read after Write, RAW Anti-Dependence, not a true dependence parallelize under right condition r3 ← r1op r2 r1 ← r5 op r4 Write afterread, WAR Output Dependence r3 ← r1 op r2 r5 ← r3 op r4 r3 ← r6 op r7 Write after Write, WAW, use in between

Register Dependencies Control Dependence: if ( condition1 ) { r3 = r1 op r2; }else{  see the jump here? r5 = r3 op r4; } // end if write( r3 );

Register Renaming Only a true dependence is a real dependence AKA Data-Dependence Others are artifacts of insufficient resources, generally register resources But that means if only more registers were available, then replacing the conflicting regswith new ones these additional resources could make conflict disappear Anti- and Output-Dependences are such false dependencies

Register Renaming Original Dependences: Renamed Situation, Dependences Gone: L1: r1 ← r2 op r3 r10 ← r2 op r30 –- r30 has r3 copy L2: r4 ← r1 op r5 r4 ← r10 op r5 L3: r1 ← r3 op r6 r1 ← r30 op r6 L4: r3 ← r1 op r7 r3 ← r1 op r7 The dependences before: after: L1, L2 true-Dep with r1 L1, L2 true-Dep with r10 L1, L3 output-Dep with r1 L3, L4 true-Dep with r1 L1, L4 anti-Dep with r3 L3, L4 true-Dep with r1 L2, L3 anti-Dep with r1 L3, L4 anti-Dep with r3

Register Renaming With additional or renamed regs, the new code runs in half the time! First : Compute into r10 instead of r1, no cost Also: Compute into r30, no added copy operations, just more registers á-priori Then regs are live afterwards: r1, r3, r4 While r10 and r30 are don’t cares

Score Board Score-board is an array of programmable bits sb[] Manages HW resources, specifically registers Single-bit array, any one bit associated with one specific register Association by index, i.e. by name: sb[i] belongs to regri Only if sb[i] = 0, does register ihave valid data If sb[i] = 0 then register riis NOT in process of being written If bit i is set, i.e. if sb[i] = 1, then that register rihas stale data Initially all sb[*] are stale, i.e. set to 1

Score Board Execution constraints: rd ← rs op rt if sb[s] or if sb[t] is set → RAW dependence, hence stall the computation; wait until both rs and rt are 0 if sb[d] is set→ WAW dependence, hence stall the write; wait until rdhas been used; SW can sometimes determine to use another register instead of rd else dispatch instruction immediately

Score Board To allow out of order (ooo) execution, upon computing the value of rd Update rd, and clear sb[d] For uses (references), HW may use any register i, whose sb[i] is 0 For definitions (assignments), HW may set any register j, whose sb[j] is 0 Independent of original order, in which source program was written, i.e. possibly ooo

References • The Humble Programmer: http://www.cs.utexas.edu/~EWD/transcriptions/EWD03xx/EWD340.html • Algorithm Definitions: http://en.wikipedia.org/wiki/Algorithm_characterizations • http://en.wikipedia.org/wiki/Moore's_law • C. A. R. Hoare’s comment on readability: http://www.eecs.berkeley.edu/~necula/cs263/handouts/hoarehints.pdf • Gibbons, P. B, and Steven Muchnick [1986]. “Efficient Instruction Scheduling for a Pipelined Architecture”, ACM Sigplan Notices, Proceeding of ’86 Symposium on Compiler Construction, Volume 21, Number 7, July 1986, pp 11-16 • Church-Turing Thesis: http://plato.stanford.edu/entries/church-turing/ • Linux design: http://www.livinginternet.com/i/iw_unix_gnulinux.htm • Words of wisdom: http://www.cs.yale.edu/quotes.html • John von Neumann’s computer design: A.H. Taub (ed.), “Collected Works of John von Neumann”, vol 5, pp. 34-79, The MacMillan Co., New York 1963

CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”

CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”

Presentation Transcript

Chapter 4: Dynamic Programming

Chapter 5 Case Study: MVC Architecture for Web Applications

GSM Protocol Architecture

Python Programming: An Introduction to Computer Science

Python Programming: An Introduction to Computer Science

Implementation of SOA

CSC 317 Computer Organization and Architecture

Chapter 2

System Software and Machine Architecture

COMPUTER ORGANIZATION AND ARCHITECTURE

DESIGN OF SOFTWARE ARCHITECTURE

Modeling and Performance Evaluation of Computer Systems

J. Glenn Brookshear

Teaching an Introductory Programming Class

CPE 323 Introduction to Embedded Computer Systems: The MSP430 System Architecture

Chapter 5 Case Study: MVC Architecture for Web Applications

CHAPTER 2 PROBLEM SOLVING

TM 331: Computer Programming Introduction to Class, Introduction to Programming

Chapter 16 - Web Programming with CGI

Introduction to Computer programming

Socket Programming(2/2)